Re: htdig: Preliminary proposal for data structures to support phrases

Jason Moore (
Mon, 8 Jun 1998 16:33:55 -0400 (EDT)

On Mon, 8 Jun 1998, Leslie Mikesell wrote:

> According to Jason Moore:
> >
> > > And you still
> > > won't know if one 2 word sequence links up with another to complete
> > > a 4 word phrase.
> >
> > Sure I do. e.g. "One two three four" would start the search at the second
> > term and end at the second last - so there would be two searches:
> >
> > 1. "two" where prev is "one" and next is "three"
> > 2. "three" where prev is "two" and next is "four"
> That means you have to have every three-word sequence indexed, not
> just every unique sequence, and you have to wade through all the
> possibilities the hard way.

hmm, that's true. A quick grep on a 14 meg mailing list archive reveals
37624 lines containing "is" and 32608 lines containing "the". That would
be a lot of "hits" to linearly search through. (I guess stop words would
be essential using this approach...)

> > > Could you instead store a list of positions within
> > > the document where each word appears? Then after looking up the
> > > potential matches containing all the words, discard the ones where
> > > you can't assemble a consecutive list of position numbers matching
> > > the words in the phrase.
> >
> > You could, but I think you might have some lookup efficiency problems. If
> > i had a gdbm file of words for keys and kept the absolute "location" (some
> > sort of number) then the only way to find the next word would be to
> > linearly search through all the results looking for the next position..
> > If you had an additional index of postions to words, then you might have
> > something... Have you tried this "location" approach before?
> You would look up the words first. There would be only one instance
> of the word indexed per document so it can be a gdbm key. Part of the value
> data would be the list of of locations.

ok, so the data structure would be something like the following:

[Word DB]
<Word> : <Word ID>

[Url DB]
<Url ID> : <Url>

[Doc DB]
<Word ID> : <Url ID>[<Url ID>]..

[Location DB]
<Word ID><Url ID> : <Location>[<Location>...]

> To find a phrase you would
> find the first word, then the second, then check to see if you can
> match a sequence where the location of the second word follows the
> first one, then repeat for the rest of the words in the phrase discarding
> the possible paths as soon as you see they can't work.

yeah, that's cool. My only concern is that each unique word in the
documents will use a key (albeit a small key and smallish value). I'll
try this with perl tonight to see how fast the lookup is and how much disk
space the index uses.

> I haven't
> tried this but I would expect the computation of possible paths
> to happen much faster than additional disk lookups. Plus, it allows
> 'near' and 'not near' searchs with a variation in the math.

The indexing thing sounds like the way to go... I'll let you know how my
tests go.


To unsubscribe from the htdig mailing list, send a message to containing the single word "unsubscribe" in
the body of the message.

This archive was generated by hypermail 2.0b3 on Sat Jan 02 1999 - 16:26:32 PST