Matt Edwards
Tue, 2 Mar 1999

(Please excuse if this has already been covered)

HtDig 3.1.1 isn't parsing (slightly non-standard) comments correctly.

Extra dashes in the comment can confuse the current parser into
ignoring a lot of content. For example <!--comment----> is seen as
an uncompleted comment beginning.

It seems a lot of web content doesn't strictly adhere to the
"standard" for comments, so we should be a little careful here.

For example both IE and Netscape require "<!--" comments to end
with a "-->" without whitespace between the "--" and the ">".
Perhaps htDig would be better off doing the same.


[modified snip from HTML.cc]
      if (strncmp((char *)position, "<!", 2) == 0)
          // Possible comment declaration (but could be DTD declaration!)
          // A comment can contain other '<' and '>':
          // we have to ignore a complete comment declarations
          // but of course also DTD declarations.
          position += 2; // Get past declaration start

          // is it a comment?
          if (strncmp((char *)position, "--", 2) == 0)
              // Found start of comment - now find the end
              q = (unsigned char*)strstr((char *)position, "-->");
              if (!q)
                  // Rest of document seems to be a comment...
                  *position = '\0';
                  position = q + 3;
              // Not a comment declaration after all
              // but possibly DTD: get to the end
              q = (unsigned char*)strstr((char *)position, ">");
              if (q)
                  position = q + 1;
                  // End of (whatever) declaration
                  *position = '\0'; // Rest of document is DTD?

According to Marjolein Katsma:
> Starting on my next project, I had to dig in HTML.cc, and found th
> efollowing code to filter out comments:

According to Gilles Detilleux
> While this will catch *most* comments, it will see some perfectly legal
> comments as illegal and skip the rest of the page. The best definition
> of comments is found in HTML 2.0 (unchanged in the actual DTD in later
> versions, but never properly explained any more...):
> "To include comments in an HTML document, use a comment declaration. A
> comment declaration consists of `<!' followed by zero or more comments
> followed by `>'. Each comment starts with `--' and includes all text up
> to and including the next occurrence of `--'. In a comment declaration,
> white space is allowed after each comment, but not before the first
> comment. The entire comment declaration is ignored."
Matthew Edwards
Go2Net Inc. 999 Third Ave Suite 4700 | progress is freedom.

