htdig: logging, phrase searching, and prefix matching


Esa Ahola (esa@cyclone.mindspring.com)
Mon, 15 Dec 1997 15:11:08 -0500 (EST)


On Mon, 15 Dec 1997, Richard Bingle wrote:

> The first is search term logging. Basically, this would be something
> that would keep track of the search terms submitted to htsearch.

I've used two approaches:

- Using GET instead of POST will have the server log the search terms (and
  other htsearch parameters sent from the form) in the web server logs,
  at the expense of some log and displayed-URL clutter.

- I have wrapped htsearch inside a Perl 'nph' cgi to enforce a limit on
  simultaneous searches per client and to display in-progress and
  other apologetic messages (big database, slow machine, sigh.) Such a
  script could easily do additional logging and usage tracking.

> The second is phrase searching. I know I can search for (mark and
> smith), but that finds too many matches when what I really want is
> "mark smith".

That's doable but expensive compared to scoring directly from the word
index. But if done, might as well implement a "near" operator that scores
according to distance. Hmmm...

While on the topic of enhancements, I have implemented my "favorite"
missing htdig feature, namely a prefix matching fuzzy algorithm. It uses
a trailing asterisk in the search term to specify which words should be
prefix matched. For example,

        foo bar* foobar

would match "foo" and "foobar" exactly, and also "bar", "barf", "bark",
"barren" etc.

The modifications necessary were quite simple (and I don't even speak C++
really; Andrew's code is *really* clearly laid out), but require
replacing gdbm with an btree index such as Berkeley DB.

-- 
Esa Ahola
esa@cyclone.mindspring.com

---------------------------------------------------------------------- To unsubscribe from the htdig mailing list, send a message to htdig-request@sdsu.edu containing the single word "unsubscribe" in the body of the message.



This archive was generated by hypermail 2.0b3 on Sat Jan 02 1999 - 16:25:24 PST