Re: htdig: Preliminary proposal for data structures to support phrases


Leslie Mikesell (les@Mcs.Net)
Mon, 8 Jun 1998 00:57:25 -0500 (CDT)


According to jmoore:
> What I propose is to store the words before and after each word in the
> index.
> -------+-----------------------------------------
> <Word> | <Previous Word><Next Word>

> The main problem with this approach as outlined, is that the index will be
> at least 3 times the size of the collected documents since the previous
> and next word is stored for each word.

But worse, you now have to store copies of every unique 3 word
sequence in the document instead of just unique words, so frequently
mentioned words will expand the index even more. And you still
won't know if one 2 word sequence links up with another to complete
a 4 word phrase. Could you instead store a list of positions within
the document where each word appears? Then after looking up the
potential matches containing all the words, discard the ones where
you can't assemble a consecutive list of position numbers matching
the words in the phrase.

  Les Mikesell
    les@mcs.com
----------------------------------------------------------------------
To unsubscribe from the htdig mailing list, send a message to
htdig-request@sdsu.edu containing the single word "unsubscribe" in
the body of the message.



This archive was generated by hypermail 2.0b3 on Sat Jan 02 1999 - 16:26:31 PST