htdig: Comments (2)


Marjolein Katsma (webmaster@javawoman.com)
Tue, 12 Jan 1999 06:59:29 +0100


The following is a patch to the original algorithm for skipping comments
before doing any further parsing of an HTML file. The original algorithm
fails to see (legal) comment declarations with whitespace after the (final)
comment in the declaration and will end up not indexing the whole document
if such a declaration is found.

I've made a small correction to my original code so that in case of a
comment declaration that doesn't seem to have an ending '>' the rest of the
document is just skipped without preventing the rest being indexed.
Otherwise 'illegal' comment declarations will be skipped until a final '>'.

Patch is a *replacement* of my original one (which was mailed but somehow
never made it to the list); compared with HTML.cc from the original 3.1.0b4
release:

diff -3p HTML.cc.orig HTMLcommentMK.cc
*** HTML.cc.orig Tue Dec 22 18:53:12 1998
--- HTMLcommentMK.cc Mon Jan 11 22:46:49 1999
***************
*** 3,8 ****
--- 3,15 ----
  //
  // Implementation of HTML
  //
+ // Revision 1999-01-07/1999-01-09 mkatsma
+ // Modification of comment-filtering algorithm so it skips all legal SGML
+ // comment declarations, including ones with whitespace after the last
+ // comment in the declaration. Illegal comment declarations are skipped
+ // till the next '>' without preventing the rest of the document being
+ // indexed.
+ //
  // $Log: HTML.cc,v $
  // Revision 1.23 1998/12/12 01:48:52 ghutchis
  // Fix coredump when META refresh tags don't have content portions (e.g.
no URL).
*************** HTML::parse(Retriever &retriever, URL &b
*** 181,198 ****

      while (*position)
      {
! if (strncmp((char *)position, "<!--", 4) == 0)
        {
            //
! // Stupid comment. This can contain other '<' and '>'
! // stuff which we have to ignore
            //
! q = (unsigned char*)strstr((char *)position, "-->");
! if (!q)
! return; // Rest of document is a comment...
! position = q + 3;
            continue;
        }
        if (*position == '<')
        {
            //
--- 188,251 ----

      while (*position)
      {
!
! // Improved algorithm 1999-01-07 Marjolein Katsma
! // (with help from Gilles Detillieux)
! // Small fix 1999-01-09
! if (strncmp((char *)position, "<!", 2) == 0)
        {
            //
! // Possible comment declaration (but could be DTD declaration!).
! // A comment can contain other '<' and '>':
! // we have to ignore a complete comment declarations
! // but of course also DTD declarations.
            //
! position += 2; // Get past
declaration start
! while (*position)
! {
! // Let's see if the declaration ends here
! if (*position == '>')
! {
! position++;
! break; // End of comment
declaration
! }
! // Not the end of the declaration yet:
! // we'll see if it is an actual comment (should
start right here)
! if (strncmp((char *)position, "--", 2) == 0)
! { //
Start of comment - now find the end
! position += 2;
! q = (unsigned char*)strstr((char *)position, "--");
! if (!q)
! {
! *position = '\0';// Rest of
document (comment?) will be skipped
! break;
! }
! position = q + 2;
! }
! else
! { // Not a
(legal) comment declaration after all;
! // could be
illegal comment or DTD:
! // get to
the end
! q = (unsigned char*)strstr((char *)position, ">");
! if (q)
! {
! position = q + 1;
! break; // End of
(whatever) declaration
! }
! else
! {
! *position = '\0';// Rest of
document (DTD?) will be skipped
! break;
! }
! }
! // Skip whitespace after an individual comment
! while (isspace(*position))
! position++;
! }
            continue;
        }
+
+
        if (*position == '<')
        {
            //

Marjolein Katsma webmaster@javawoman.com
Java Woman - http://javawoman.com/
----------------------------------------------------------------------
To unsubscribe from the htdig mailing list, send a message to
htdig-request@sdsu.edu containing the single word "unsubscribe" in
the body of the message.



This archive was generated by hypermail 2.0b3 on Wed Jan 13 1999 - 09:13:05 PST