[htdig3-dev] Re: substring (was Multiple database (patch))

Subject: [htdig3-dev] Re: substring (was Multiple database (patch))
From: Geoff Hutchison (ghutchis@wso.williams.edu)
Date: Thu Feb 10 2000 - 16:23:11 PST

At 12:26 PM +0100 2/10/00, loic@ceic.com wrote:
>by the code. But if it's activated, a list of unique words is maintained
>in the index. I use this a lot in a context other than htdig so I'm really
>sure it works well. But it takes a bit more space, of course.
> The 'substring' search could browse this list instead of the complete index
>and that would give a list of candidates much more quickly.

This would speed up things for now, but the algorithm I'm thinking of
would speed things up considerably more than this. Basically, you
make a list of all the trigrams (or n-grams) in your query, then you
use a pre-computed trigram database (like the metaphone and soundex
ones) to narrow down the search space to only the words that have all
the trigrams in your query. There will be some mismatches, so you'll
still need to check the substring on these but you'll be doing it
over a small subset of the word database.


query: metaphone -> met eta tap aph pho hon one

There are obviously few words with all of these trigrams so you're set.


