Re: [htdig] Re: htdig: Comments


Matt Edwards (medwards@go2net.com)
Tue, 2 Mar 1999 16:53:51 -0800 (PST)


On Tue, 2 Mar 1999, Matt Edwards wrote:
> HtDig 3.1.1 isn't parsing (slightly non-standard) comments correctly.
>
> Extra dashes in the comment can confuse the current parser into
> ignoring a lot of content. For example <!--comment----> is seen as
> an uncompleted comment beginning.
>
> It seems a lot of web content doesn't strictly adhere to the
> "standard" for comments, so we should be a little careful here.
>
> For example both IE and Netscape require "<!--" comments to end
> with a "-->" without whitespace between the "--" and the ">".
> Perhaps htDig would be better off doing the same.

In response, Gilles Detillieux wrote:
> Marjolein brought up this issue in January. The htdig code used
> to do what you're requesting, but she wanted it changed to adhere to
> the standard. I only helped her debug her code so it would do what she
> wanted it to, to allow (require) standard comments. She went on to give
> a few examples of what standard comments could be:
>
>> Thus, the following are legal comment declarations:
>>
>> <!--first comment
>> on two lines --
>>
>> --second comment--
>> --third comment--
>> >
>>

Except that both IE and netscape treat the above as an unclosed
comment beginning, so nobody can get away with doing this in the real
world.

However, my real issue was not with this behavior, but with the parser
getting confused by extra dashes in comments. Extra dashes may be
"non-standard", but because both netscape and IE allow them,
I've found enough content with extra dashes to make me worry.

How about a compromise where whitespace is allowed between the final
"--" and the closing ">".

i.e.

[modified snip from HTML.c]
      if (strncmp((char *)position, "<!", 2) == 0)
        {
          //
          // Possible comment declaration (but could be DTD declaration!)
          // A comment can contain other '<' and '>':
          // we have to ignore a complete comment declarations
          // but of course also DTD declarations.
          //
          position += 2; // Get past declaration start

          // Not the end of the declaration yet:
          // we'll try to find an actual comment
          if (strncmp((char *)position, "--", 2) == 0)
            {
              // loop to find the end of the comment
              // end will have "--" followed by optional whitespace
              // and a closing ">".
              while(*position)
                {
                  // First, find the "--" part
                  q = (unsigned char*)strstr((char *)position, "--");
                  if (!q)
                    {
                      // rest of document seems to be a comment...
                      *position = '\0';
                    }
                  else
                    {
                      // Second, ignore whitespace between -- and >
                      position = q;
                      q += 2;
                      while (isspace(*q))
                          q++;

                      // Third, look for the closing ">"
                      if (*q == '>')
                        {
                          // found the end of the comment.
                          position = q + 1;
                          break;
                        }
                      else
                        {
                          // this wasn't the end of the comment
                          // skip the first dash and try again.
                          position += 1;
                        }
                    }
                }
            }
          else
            {
              // Not a comment declaration after all
              // but possibly DTD: get to the end
              q = (unsigned char*)strstr((char *)position, ">");
              if (q)
                {
                  position = q + 1;
                  // End of (whatever) declaration
                }
              else
                {
                  *position = '\0'; // Rest of document is DTD?
                }
              
            }
          continue;
        }
[end snip]

> At the time, i.e. in 3.1.0b4, htdig didn't handle these, and your code
> snippet won't either. I'm assuming she had a reason to want this change.
> My feeling is htdig should respect the standard, and any non-standard
> behaviour should be optional.

Good point. However there is a real-word industry standard here and a
theoretical paper standard. Which behaviour would most people prefer out
of the box?

------------------------------------
To unsubscribe from the htdig mailing list, send a message to
htdig@htdig.org containing the single word "unsubscribe" in
the SUBJECT of the message.



This archive was generated by hypermail 2.0b3 on Thu Mar 04 1999 - 09:09:18 PST