htdig: StringMatch bug?


Majeau, Benoit (Benoit.Majeau@nrc.ca)
Wed, 9 Dec 1998 16:35:20 -0500


Hi

Some files aren't being indexed because the StringMatch library is not
working properly (I mean, I guess the findFirst is not working properly :).

Here are some simple facts:

1- Here's the portion of my log when the file WASN'T being indexed:

        [...]
        image: http://www.nrc.ca/corporate/english/images/navw-nrc.gif
        Tag: /A>, matched 3
        href: http://www.nrc.ca/corporate/english/tools/institut.html ()
        Not added because: Item in the exclude list! (In fact the 38th
value, 2 large)*
        url rejected: (level
1)http://www.nrc.ca/corporate/english/tools/institut.html
        Tag: IMG SRC="images/navw-bar.gif" WIDTH="2" HEIGHT="20"
ALIGN="BOTTOM" BORDER="0">, matched 18
        [...]

        * Geoff, I modified the Retriever::IsValidUrl() so that when an URL
is actually invalid, it prints out the reason. Could that be implemented in
the new version since it's taking just a few minutes to add? It is REALLY
useful when you want to have a good follow (...when debugging) of your
digging.

        But why the length is 2??? Should be 3. Because a link with
"cgi-bin" gave me that message:

        href: http://www.nrc.ca/cgi-bin/corporate/external.pl ()
        Not added because: Item in the exclude list! (In fact the 0th value,
7 large)

        Anyway, I knew it was my 38th string (from 0) in my exclude list
that was matching the URL. ("rct").

2- I had this exclude list:

exclude_urls: cgi-bin .cgi cwis ctn.nrc.ca rct.nrc.ca irap /catalog_3d/ IRIX
irix hrb /infocisti/ cwis /test/ /temp/ /temp1/ /temp2/ /tempdir/ /zone/
ccbfc ptcbs cccme icsti /arctic /aic-journal acst /conferences/ ccsg /ctn
/dtf-gtn/ fptt /programs/ /nzdl/ /wusage/ /w-usage/ /catalog_int_ascii/
harvest gatherer broker rct stats /backup/ /testdir/ /confserv/ fox lynx

... on 1 single line.

As we can see, "rct" is not located in my url:
www.nrc.ca/corporate/english/tools/institut.html
<http://www.nrc.ca/corporate/english/tools/institut.html>

So, I removed that string ("rct") in my exclude list and the file has been
indexed (WEIRD!).

I tried to look at the findFirst function, but I'm not a big fan of binary
and masking stuff, don't find it really intuitive ;). Anyway, too bad we
can't use the "=~" operator from Perl, it would have taken 5 lines instead
of 40 ;))).

Anyway, did I miss something? I would really like to know the bottom reasons
of all this.

P.S.: Btw, the exclude_urls is loading properly in the StringMatch library,
so the problem is obviously when you compare your string/patterns =)




This archive was generated by hypermail 2.0b3 on Sat Jan 02 1999 - 16:29:49 PST