BOUNCE htdig: Admin request


owner-htdig@sdsu.edu
Thu, 7 Jan 1999 11:55:09 -0800 (PST)


>From andrew@contigo.com Thu Jan 7 11:55:07 1999
Received: from spartacus (spartacus.a2000.nl [62.108.1.20])
        by sdsu.edu (8.8.7/8.8.7) with ESMTP id LAA12579
        for <htdig@sdsu.edu>; Thu, 7 Jan 1999 11:54:59 -0800 (PST)
Received: from node149c.a2000.nl ([62.108.20.156] helo=albert)
        by spartacus with smtp (Exim 2.02 #4)
        id 0zyLVc-0001W2-00; Thu, 7 Jan 1999 20:54:20 +0100
Message-Id: <4.1.19990107204436.040d24f0@pop3.demon.nl>
X-Sender: taxon@pop3.demon.nl
X-Mailer: QUALCOMM Windows Eudora Pro Version 4.1
Date: Thu, 07 Jan 1999 20:54:17 +0100
To: Gilles Detillieux <grdetil@scrc.umanitoba.ca>
From: Marjolein Katsma <HSH@taxon.demon.nl>
Subject: Re: htdig: Comments
Cc: htdig@sdsu.edu
In-Reply-To: <199901062151.PAA19312@cliff.scrc.umanitoba.ca>
References: <4.1.19990106070428.00a84f10@pop3.demon.nl>
Mime-Version: 1.0
Content-Type: text/plain; charset="us-ascii"

Gilles,

Thanks for the help. I had a feeling *something* was wrong, just didn't see
it any more (too many late nights...)
At 15:51 1999-01-06 -0600, Gilles Detillieux wrote:
>According to Marjolein Katsma:
[snip]
>
>OK, I've found a couple problems. First of all, you shouldn't use
>strstr to find the start of the comment, as you could skip over a DTD
>and everything up to the start of the next comment. As the first comment
>should begin right away, and subsequent comments can be separated only by
>white space (which you skip after the end of the comment), you don't need
>to search at all. It should be right there. Secondly, there's a potential
>infinite loop situation, if it doesn't find any "--" or ">" after the "<!".

Your changes worked. I've created a couple of test files,
ctest1.html-ctest6.html, which can be found at:
        http://javawoman.com/tests/ctest.zip

Test results (completely new databases created for each):

-- old program --

javawoman: {9} % ./rundig_ctest
htdig: Run complete
htdig: 1 server seen:
htdig: javawoman.com:80 6 documents
htmerge: Total word count: 6
htmerge: Total documents: 6
htmerge: Total doc db size (in K): 1

-- db_wordlist from old program --

comment l:256 i:5 w:74400
comment l:285 i:3 w:71500
comment l:293 i:1 w:70700
comment l:357 i:0 w:64300
fout l:446 i:1 w:554 //after illegal comment (text between comments)
goochelaar l:435 i:3 w:565 //OK
magician l:535 i:0 w:465 //OK
prestidigitateu l:602 i:0 w:398 //OK
test l:282 i:5 w:71800
test l:314 i:3 w:68600
test l:322 i:1 w:67800
test l:392 i:0 w:60800
zauberer l:570 i:5 w:430 //after extra DTD
                                    
                                    
                                    
-- new program --

javawoman: {10} % ./rundig_ctest
htdig: Run complete
htdig: 1 server seen:
htdig: javawoman.com:80 6 documents
htmerge: Total word count: 8
htmerge: Total documents: 6
htmerge: Total doc db size (in K): 1

-- db_wordlist from new program --

comment l:111 i:0 w:88900
comment l:80 i:5 w:92000
comment l:89 i:3 w:91100
comment l:89 i:4 w:91100
comment l:91 i:1 w:90900
comment l:93 i:2 w:90700
fout l:245 i:1 w:755 //(!) after illegal comment (text between comments)
goochelaar l:239 i:3 w:761 //OK
magician l:290 i:0 w:710 //OK
prestidigitateu l:343 i:0 w:657 //OK
test l:105 i:5 w:89500
test l:117 i:3 w:88300
test l:118 i:4 w:88200
test l:120 i:1 w:88000
test l:124 i:2 w:87600
test l:147 i:0 w:85300
tovenaar l:240 i:4 w:760 //OK (!) was skipped in old version...
wrong l:251 i:2 w:749 //(!) after illegal comment (text after comment)
zauberer l:217 i:5 w:783 //after extra DTD

-------------------------------

Note that in the original program:
1) the document with text after a comment and before > is not indexed at all:
   'comment' and 'text' occur in title but don't appear for ctest3.html (i:2)
2) the document with white space after the last comment and before > is not
   indexed at all (but should be):
   'comment' and 'text' occur in title but don't appear for ctest5.html (i:4)

The new algorithm not only treats the comment with whitespace correctly, it's
also 'tolerant' to illegal comment declarations as long as they start with
'<!--' and end with '>'. This could be repaired, but I like it this way ;-)

Here's a patch, diff with distribution version 3.1.0b4:

*** HTML.cc.orig Tue Dec 22 18:53:12 1998
--- HTML.cc Thu Jan 7 12:37:05 1999
*************** HTML::parse(Retriever &retriever, URL &b
*** 181,198 ****

      while (*position)
      {
! if (strncmp((char *)position, "<!--", 4) == 0)
        {
            //
! // Stupid comment. This can contain other '<' and '>'
! // stuff which we have to ignore
            //
! q = (unsigned char*)strstr((char *)position, "-->");
! if (!q)
! return; // Rest of document is a comment...
! position = q + 3;
            continue;
        }
        if (*position == '<')
        {
            //
--- 181,250 ----

      while (*position)
      {
!
! // if (strncmp((char *)position, "<!--", 4) == 0)
! // {
! // //
! // // Stupid comment. This can contain other '<' and '>'
! // // stuff which we have to ignore
! // //
! // q = (unsigned char*)strstr((char *)position, "-->");
! // if (!q)
! // return; // Rest of document is a comment...
! // position = q + 3;
! // continue;
! // }
!
! // Improved algorithm 1999-01-07 Marjolein Katsma
! // (with help from Gilles Detillieux)
! if (strncmp((char *)position, "<!", 2) == 0)
        {
            //
! // Possible comment declaration (but could be DTD declaration!).
! // A comment can contain other '<' and '>':
! // we have to ignore a complete comment declarations
! // but of course also DTD declarations.
            //
! position += 2; // Get past
declaration start
! while (*position)
! {
! // Let's see if the declaration ends here
! if (*position == '>')
! {
! position++;
! break; // End of comment
declaration
! }
! // Not the end of the declaration yet:
! // we'll see if it is an actual comment (should
start right here)
! if (strncmp((char *)position, "--", 2) == 0)
! { //
Start of comment - now find the end
! position += 2;
! q = (unsigned char*)strstr((char *)position, "--");
! if (!q)
! return; // Rest of
document seems to be a comment...
! position = q + 2;
! }
! else
! { // Not a
(legal) comment declaration after all;
! // could be
illegal comment or DTD:
! // get to
the end
! q = (unsigned char*)strstr((char *)position, ">");
! if (q)
! {
! position = q + 1;
! break; // End of
(whatever) declaration
! }
! else
! return; // Rest of document
is DTD?
! }
! // Skip whitespace after an individual comment
! while (isspace(*position))
! position++;
! }
            continue;
        }
+
+
        if (*position == '<')
        {
            //

>
>--
>Gilles R. Detillieux E-mail: <grdetil@scrc.umanitoba.ca>
>Spinal Cord Research Centre WWW:
http://www.scrc.umanitoba.ca/~grdetil
>Dept. Physiology, U. of Manitoba Phone: (204)789-3766
>Winnipeg, MB R3E 3J7 (Canada) Fax: (204)789-3930

Cheers,

Marjolein Katsma
Java Woman - http://javawoman.com
HomeSite Help - http://hshelp.com/



This archive was generated by hypermail 2.0b3 on Sun Jan 10 1999 - 16:36:29 PST