[htdig] Re: htdig: Comments


Matt Edwards (medwards@go2net.com)
Tue, 2 Mar 1999 12:04:39 -0800 (PST)


(Please excuse if this has already been covered)

HtDig 3.1.1 isn't parsing (slightly non-standard) comments correctly.

Extra dashes in the comment can confuse the current parser into
ignoring a lot of content. For example <!--comment----> is seen as
an uncompleted comment beginning.

It seems a lot of web content doesn't strictly adhere to the
"standard" for comments, so we should be a little careful here.

For example both IE and Netscape require "<!--" comments to end
with a "-->" without whitespace between the "--" and the ">".
Perhaps htDig would be better off doing the same.

i.e.:

[modified snip from HTML.cc]
      if (strncmp((char *)position, "<!", 2) == 0)
        {
          //
          // Possible comment declaration (but could be DTD declaration!)
          // A comment can contain other '<' and '>':
          // we have to ignore a complete comment declarations
          // but of course also DTD declarations.
          //
          position += 2; // Get past declaration start

          // is it a comment?
          if (strncmp((char *)position, "--", 2) == 0)
            {
              // Found start of comment - now find the end
              q = (unsigned char*)strstr((char *)position, "-->");
              if (!q)
                {
                  // Rest of document seems to be a comment...
                  *position = '\0';
                }
              else
                {
                  position = q + 3;
                }
            }
          else
            {
              // Not a comment declaration after all
              // but possibly DTD: get to the end
              q = (unsigned char*)strstr((char *)position, ">");
              if (q)
                {
                  position = q + 1;
                  // End of (whatever) declaration
                }
              else
                {
                  *position = '\0'; // Rest of document is DTD?
                }
            }
          continue;
        }
[snip]

According to Marjolein Katsma:
> Starting on my next project, I had to dig in HTML.cc, and found th
> efollowing code to filter out comments:

According to Gilles Detilleux
> While this will catch *most* comments, it will see some perfectly legal
> comments as illegal and skip the rest of the page. The best definition
> of comments is found in HTML 2.0 (unchanged in the actual DTD in later
> versions, but never properly explained any more...):
>
> "To include comments in an HTML document, use a comment declaration. A
> comment declaration consists of `<!' followed by zero or more comments
> followed by `>'. Each comment starts with `--' and includes all text up
> to and including the next occurrence of `--'. In a comment declaration,
> white space is allowed after each comment, but not before the first
> comment. The entire comment declaration is ignored."
>
Matthew Edwards (medwards@go2net.com) | The fuel of innovation and
Go2Net Inc. 999 Third Ave Suite 4700 | progress is freedom.

------------------------------------
To unsubscribe from the htdig mailing list, send a message to
htdig@htdig.org containing the single word "unsubscribe" in
the SUBJECT of the message.



This archive was generated by hypermail 2.0b3 on Thu Mar 04 1999 - 09:09:18 PST