htdig: StringMatch bug?

Majeau, Benoit (
Wed, 9 Dec 1998 16:35:20 -0500


Some files aren't being indexed because the StringMatch library is not
working properly (I mean, I guess the findFirst is not working properly :).

Here are some simple facts:

1- Here's the portion of my log when the file WASN'T being indexed:

        Tag: /A>, matched 3
        href: ()
        Not added because: Item in the exclude list! (In fact the 38th
value, 2 large)*
        url rejected: (level
        Tag: IMG SRC="images/navw-bar.gif" WIDTH="2" HEIGHT="20"
ALIGN="BOTTOM" BORDER="0">, matched 18

        * Geoff, I modified the Retriever::IsValidUrl() so that when an URL
is actually invalid, it prints out the reason. Could that be implemented in
the new version since it's taking just a few minutes to add? It is REALLY
useful when you want to have a good follow (...when debugging) of your

        But why the length is 2??? Should be 3. Because a link with
"cgi-bin" gave me that message:

        href: ()
        Not added because: Item in the exclude list! (In fact the 0th value,
7 large)

        Anyway, I knew it was my 38th string (from 0) in my exclude list
that was matching the URL. ("rct").

2- I had this exclude list:

exclude_urls: cgi-bin .cgi cwis irap /catalog_3d/ IRIX
irix hrb /infocisti/ cwis /test/ /temp/ /temp1/ /temp2/ /tempdir/ /zone/
ccbfc ptcbs cccme icsti /arctic /aic-journal acst /conferences/ ccsg /ctn
/dtf-gtn/ fptt /programs/ /nzdl/ /wusage/ /w-usage/ /catalog_int_ascii/
harvest gatherer broker rct stats /backup/ /testdir/ /confserv/ fox lynx

... on 1 single line.

As we can see, "rct" is not located in my url:

So, I removed that string ("rct") in my exclude list and the file has been
indexed (WEIRD!).

I tried to look at the findFirst function, but I'm not a big fan of binary
and masking stuff, don't find it really intuitive ;). Anyway, too bad we
can't use the "=~" operator from Perl, it would have taken 5 lines instead
of 40 ;))).

Anyway, did I miss something? I would really like to know the bottom reasons
of all this.

P.S.: Btw, the exclude_urls is loading properly in the StringMatch library,
so the problem is obviously when you compare your string/patterns =)

This archive was generated by hypermail 2.0b3 on Sat Jan 02 1999 - 16:29:49 PST