Re: AW: [htdig] Prefix search


Subject: Re: AW: [htdig] Prefix search
From: Geoff Hutchison (ghutchis@wso.williams.edu)
Date: Wed Aug 23 2000 - 06:06:00 PDT


First off, see <http://www.htdig.org/FAQ.html#q1.16>

At 9:38 AM +0200 8/23/00, Reich, Stefan wrote:
>maybe I was not specific enough.
>
>Range search means:
>
>I want to search all documents which contain eg. all words between
>2000-02-01 and 2000-05-01 (as I said, the "searchwords" in my databsase are
>mostly dates!
>
>My first idea was to construct a search list like:
>
>(2000-02* or 2000-03* or 2000-04* or 2000-05*)
>
>I thought this is ok for a short range, but not for lets say 1998-01 to
>2000-08.
>
>But now I've seen that a search for 2000* comes up with a list in
>$LOGICAL_WORDS containing some hundreds of dates.
>
>I don't know, if this large list is built first and then used by htsearch or
>if this list is built up during search and only presented afterwards.
>
>If the list is built before, I can do something, which I normally would not
>have done because of performance issues:
>
>I can use db.wordlist to build my search list as described above. I will not
>include all possible dates but only these contained in db.wordlist. This
>will cut the list shorter. But it still may contain hundreds of words.

If you just want to do dates, then this will be fine. But since this
is a type of substring search, you will need to be careful making an
additional database since it could become exponentially large. (The
more substrings you want to match, the larger the database.)

>So to cut the question short:
>
>1. Is there a limit to the number of words a search can contain? (not only
>technical, but also from pov of performance?)

It would be hard to estimate. Certainly upwards of 100 words would
probably not be too useful from a performance view.

The only "technical limit" on htsearch is a timeout after 5 minutes.
Even that you could change in the code.

>2. If a database contains (ant ape antilope bear ...) what search is faster
>"ant or antilope or ape or bear or ..." or "a* or b* or c* ..."
>
>I hope this is not to confusing ;-) I know we are using htdig far beyond the
>scope it is designed for, but I am amazed how much is possible!!!

The former is faster--any time you have to do a fuzzy search, it must
first generate the list of alternatives and THEN do the search. (Also
remember that you can't "chain" fuzzy methods, though this has been
suggested.)

--
-Geoff Hutchison
Williams Students Online
http://wso.williams.edu/

------------------------------------ To unsubscribe from the htdig mailing list, send a message to htdig-unsubscribe@htdig.org You will receive a message to confirm this. List archives: <http://www.htdig.org/mail/menu.html> FAQ: <http://www.htdig.org/FAQ.html>



This archive was generated by hypermail 2b28 : Wed Aug 23 2000 - 06:18:49 PDT