htdig: Comments


Marjolein Katsma (HSH@taxon.demon.nl)
Wed, 06 Jan 1999 07:17:44 +0100


Starting on my next project, I had to dig in HTML.cc, and found th
efollowing code to filter out comments:

// if (strncmp((char *)position, "<!--", 4) == 0)
// {
// //
// // Stupid comment. This can contain other '<' and '>'
// // stuff which we have to ignore
// //
// q = (unsigned char*)strstr((char *)position, "-->");
// if (!q)
// return; // Rest of document is a comment...
// position = q + 3;
// continue;
// }

While this will catch *most* comments, it will see some perfectly legal
comments as illegal and skip the rest of the page. Th ebest definition of
comments is found in HTML 2.0 (unchanged in the actual DTD in later
versions, but never properly explained any more...):

"To include comments in an HTML document, use a comment declaration. A
comment declaration consists of `<!' followed by zero or more comments
followed by `>'. Each comment starts with `--' and includes all text up to
and including the next occurrence of `--'. In a comment declaration, white
space is allowed after each comment, but not before the first comment. The
entire comment declaration is ignored."

Thus, the following are legal comment declarations:

<!--first comment
on two lines --

--second comment--
--third comment--
>

<!>

Both of these would be missed by htDig; in th ecase of the first the rest
of the page would be considered a comment, the second would not be
recognized as a comment declaration.

I've dreamed up the following code to take care of this - totally untested
so far! - I'd like to hear any hints how to improve it (if necessary).

        if (strncmp((char *)position, "<!", 2) == 0)
        {
            //
            // Possible comment declaration (but could be DTD declaration!).
            // A comment can contain other '<' and '>':
            // we have to ignore a complete comment declarations
            // but of course also DTD declarations.
            //
                position += 2; // Get past declaration start
                while (*position)
                {
                        // Let's see if the declaration ends here
                        if (*position == '>')
                    {
                            position++;
                            break; // End of comment declaration
                    }
                        // Not the end of the declaration yet:
                        // we'll try to find an actual comment
                    q = (unsigned char*)strstr((char *)position, "--");
                    if (q)
            { // Found start of comment - now find the end
                            position = q + 2;
                            q = (unsigned char*)strstr((char *)position, "--");
                            if (!q)
                                        return; // Rest of document seems to be a comment...
                            position = q + 2;
                    }
                    else
                    { // Not a comment declaration after all
                                                    // but possibly DTD: get to the end
                            q = (unsigned char*)strstr((char *)position, ">");
                            if (q)
                            {
                                position = q + 1;
                                    break; // End of (whatever) declaration
                            }
                        }
                        // Skip whitespace after an individual comment
                        while (isspace(*position))
                                position++;
                }
            continue;
        }

Comments anyone? ;-)

Marjolein Katsma
Java Woman - http://javawoman.com
HomeSite Help - http://hshelp.com/
----------------------------------------------------------------------
To unsubscribe from the htdig mailing list, send a message to
htdig-request@sdsu.edu containing the single word "unsubscribe" in
the body of the message.



This archive was generated by hypermail 2.0b3 on Thu Jan 07 1999 - 07:52:39 PST