[htdig3-dev] Regex


Geoff Hutchison (ghutchis@wso.williams.edu)
Tue, 4 May 1999 20:58:49 -0400


OK, I took some time today to fool around with the regex changes to htdig's
limits and excludes. Believe it or not, part of that was coding a minimal
Regex fuzzy algorithm (more on that in a second).

Has anyone else tried it? The consensus seemed to be that this is a "Good
Thing(c)" but I haven't heard any feedback. Personally, it seems slightly
broken.

Let me rephrase that. I admit to being *VERY* bad at using regexp, almost
always having to refer to various guides for anything beyond the least
complicated expression. (Yes, I've even taken classes on them. I can design
finite automata to parse arbitrarily complex regexp, but it doesn't seem to
work the other way around.)

Not only did it not seem to work on my original set of testing configs (it
wouldn't index more than exactly the start_url), but when I tried defining
actual regexp to test the feature, it behaved in *very* strange ways.

For example:
start_url: http://www.htdig.org/contents.html
http://dev.htdig.org/contents.html
limit_urls_to: *.htdig.org/*.html

This only indexed the start_urls.

Anyway, I thought it might be easier for me to debug using a regex fuzzy
algorithm. So I wrote one along the lines of the substring algorithm. I
have a feeling it will be popular if it works as promised. But here's an
example:

Enter value for words: gdbm*
tempWords: 'gdbm*:0 '
Boolean: 'gdbm*:0 '
initial: ''
Fuzzy on: gdbm*
   exact gdbm*
   regex dbwordsgdbm gdbm gdbmdb ggdb htlibgdbmdbc htlibgdbmdbh lgdbm

I don't know about you, but I smell a bug when "ggdb" is matched for
"gdbm*" I'm clearly missing something. Is it my regex incompetence?

-Geoff

------------------------------------
To unsubscribe from the htdig3-dev mailing list, send a message to
htdig3-dev@htdig.org containing the single word "unsubscribe" in
the SUBJECT of the message.



This archive was generated by hypermail 2.0b3 on Tue May 04 1999 - 18:10:38 PDT