Re: htdig: Comments


Gilles Detillieux (grdetil@scrc.umanitoba.ca)
Wed, 6 Jan 1999 15:51:13 -0600 (CST)


According to Marjolein Katsma:
> Starting on my next project, I had to dig in HTML.cc, and found th
> efollowing code to filter out comments:
[snip]
> While this will catch *most* comments, it will see some perfectly legal
> comments as illegal and skip the rest of the page. Th ebest definition of
> comments is found in HTML 2.0 (unchanged in the actual DTD in later
> versions, but never properly explained any more...):
>
> "To include comments in an HTML document, use a comment declaration. A
> comment declaration consists of `<!' followed by zero or more comments
> followed by `>'. Each comment starts with `--' and includes all text up to
> and including the next occurrence of `--'. In a comment declaration, white
> space is allowed after each comment, but not before the first comment. The
> entire comment declaration is ignored."
>
> Thus, the following are legal comment declarations:
>
> <!--first comment
> on two lines --
>
> --second comment--
> --third comment--
> >
>
> <!>
>
> Both of these would be missed by htDig; in th ecase of the first the rest
> of the page would be considered a comment, the second would not be
> recognized as a comment declaration.
>
> I've dreamed up the following code to take care of this - totally untested
> so far! - I'd like to hear any hints how to improve it (if necessary).

OK, I've found a couple problems. First of all, you shouldn't use
strstr to find the start of the comment, as you could skip over a DTD
and everything up to the start of the next comment. As the first comment
should begin right away, and subsequent comments can be separated only by
white space (which you skip after the end of the comment), you don't need
to search at all. It should be right there. Secondly, there's a potential
infinite loop situation, if it doesn't find any "--" or ">" after the "<!".

>
> if (strncmp((char *)position, "<!", 2) == 0)
> {
> //
> // Possible comment declaration (but could be DTD declaration!).
> // A comment can contain other '<' and '>':
> // we have to ignore a complete comment declarations
> // but of course also DTD declarations.
> //
> position += 2; // Get past declaration start
> while (*position)
> {
> // Let's see if the declaration ends here
> if (*position == '>')
> {
> position++;
> break; // End of comment declaration
> }
> // Not the end of the declaration yet:
> // we'll try to find an actual comment
> q = (unsigned char*)strstr((char *)position, "--");
> if (q)
> { // Found start of comment - now find the end
> position = q + 2;

Replace the above 4 lines with:

                    if (strncmp((char *)position, "--", 2) == 0)
                    { // Found start of comment - now find the end
                        position += 2;

> q = (unsigned char*)strstr((char *)position, "--");
> if (!q)
> return; // Rest of document seems to be a comment...
> position = q + 2;
> }
> else
> { // Not a comment declaration after all
> // but possibly DTD: get to the end
> q = (unsigned char*)strstr((char *)position, ">");
> if (q)
> {
> position = q + 1;
> break; // End of (whatever) declaration
> }

Add this code here:

                            else
                                return; // Rest of document is DTD?

> }
> // Skip whitespace after an individual comment
> while (isspace(*position))
> position++;
> }
> continue;
> }

-- 
Gilles R. Detillieux              E-mail: <grdetil@scrc.umanitoba.ca>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:    (204)789-3930
----------------------------------------------------------------------
To unsubscribe from the htdig mailing list, send a message to
htdig-request@sdsu.edu containing the single word "unsubscribe" in
the body of the message.



This archive was generated by hypermail 2.0b3 on Thu Jan 07 1999 - 07:52:40 PST