Re: htdig: Preliminary proposal for data structures to support phrases


Jason Moore (jmoore@informinteractive.com)
Mon, 8 Jun 1998 11:17:16 -0400 (EDT)


On Mon, 8 Jun 1998, Leslie Mikesell wrote:

> According to jmoore:
> > What I propose is to store the words before and after each word in the
> > index.
> > -------+-----------------------------------------
> > <Word> | <Previous Word><Next Word>
>
> > The main problem with this approach as outlined, is that the index will be
> > at least 3 times the size of the collected documents since the previous
> > and next word is stored for each word.
>
> But worse, you now have to store copies of every unique 3 word
> sequence in the document instead of just unique words, so frequently
> mentioned words will expand the index even more.

Yes each instance of a common word (e.g. "is") will expand the index.
that's why I outlined the "dictionary" idea where each word is kept in a
dictionary, and only ID's for the word are used. This means that each
word needs 14 bytes. (btw - currently, 5 integers are used which uses 20
bytes per word). If this is perceived as too expensive, then
assumptions/optimizations can be made (which i started to outline in the
last posting).

> And you still
> won't know if one 2 word sequence links up with another to complete
> a 4 word phrase.

Sure I do. e.g. "One two three four" would start the search at the second
term and end at the second last - so there would be two searches:

1. "two" where prev is "one" and next is "three"
2. "three" where prev is "two" and next is "four"

(The second step would just limit the results returned from the first.)
Because every word in the index is linked, you can build phrases of an
arbitrary length.

> Could you instead store a list of positions within
> the document where each word appears? Then after looking up the
> potential matches containing all the words, discard the ones where
> you can't assemble a consecutive list of position numbers matching
> the words in the phrase.

You could, but I think you might have some lookup efficiency problems. If
i had a gdbm file of words for keys and kept the absolute "location" (some
sort of number) then the only way to find the next word would be to
linearly search through all the results looking for the next position..
If you had an additional index of postions to words, then you might have
something... Have you tried this "location" approach before?

:jason

jmoore@sober.com

----------------------------------------------------------------------
To unsubscribe from the htdig mailing list, send a message to
htdig-request@sdsu.edu containing the single word "unsubscribe" in
the body of the message.



This archive was generated by hypermail 2.0b3 on Sat Jan 02 1999 - 16:26:31 PST