Matt Edwards (email@example.com)
Tue, 2 Mar 1999 12:04:39 -0800 (PST)
(Please excuse if this has already been covered)
HtDig 3.1.1 isn't parsing (slightly non-standard) comments correctly.
Extra dashes in the comment can confuse the current parser into
ignoring a lot of content. For example <!--comment----> is seen as
an uncompleted comment beginning.
It seems a lot of web content doesn't strictly adhere to the
"standard" for comments, so we should be a little careful here.
For example both IE and Netscape require "<!--" comments to end
with a "-->" without whitespace between the "--" and the ">".
Perhaps htDig would be better off doing the same.
[modified snip from HTML.cc]
if (strncmp((char *)position, "<!", 2) == 0)
// Possible comment declaration (but could be DTD declaration!)
// A comment can contain other '<' and '>':
// we have to ignore a complete comment declarations
// but of course also DTD declarations.
position += 2; // Get past declaration start
// is it a comment?
if (strncmp((char *)position, "--", 2) == 0)
// Found start of comment - now find the end
q = (unsigned char*)strstr((char *)position, "-->");
// Rest of document seems to be a comment...
*position = '\0';
position = q + 3;
// Not a comment declaration after all
// but possibly DTD: get to the end
q = (unsigned char*)strstr((char *)position, ">");
position = q + 1;
// End of (whatever) declaration
*position = '\0'; // Rest of document is DTD?
According to Marjolein Katsma:
> Starting on my next project, I had to dig in HTML.cc, and found th
> efollowing code to filter out comments:
According to Gilles Detilleux
> While this will catch *most* comments, it will see some perfectly legal
> comments as illegal and skip the rest of the page. The best definition
> of comments is found in HTML 2.0 (unchanged in the actual DTD in later
> versions, but never properly explained any more...):
> "To include comments in an HTML document, use a comment declaration. A
> comment declaration consists of `<!' followed by zero or more comments
> followed by `>'. Each comment starts with `--' and includes all text up
> to and including the next occurrence of `--'. In a comment declaration,
> white space is allowed after each comment, but not before the first
> comment. The entire comment declaration is ignored."
Matthew Edwards (firstname.lastname@example.org) | The fuel of innovation and
Go2Net Inc. 999 Third Ave Suite 4700 | progress is freedom.
To unsubscribe from the htdig mailing list, send a message to
email@example.com containing the single word "unsubscribe" in
the SUBJECT of the message.
This archive was generated by hypermail 2.0b3 on Thu Mar 04 1999 - 09:09:18 PST