Re: htdig: Preliminary proposal for data structures to support phrases


Leslie Mikesell (les@Mcs.Net)
Mon, 8 Jun 1998 14:21:29 -0500 (CDT)


According to Jason Moore:
>
> > And you still
> > won't know if one 2 word sequence links up with another to complete
> > a 4 word phrase.
>
> Sure I do. e.g. "One two three four" would start the search at the second
> term and end at the second last - so there would be two searches:
>
> 1. "two" where prev is "one" and next is "three"
> 2. "three" where prev is "two" and next is "four"

That means you have to have every three-word sequence indexed, not
just every unique sequence, and you have to wade through all the
possibilities the hard way.

> > Could you instead store a list of positions within
> > the document where each word appears? Then after looking up the
> > potential matches containing all the words, discard the ones where
> > you can't assemble a consecutive list of position numbers matching
> > the words in the phrase.
>
> You could, but I think you might have some lookup efficiency problems. If
> i had a gdbm file of words for keys and kept the absolute "location" (some
> sort of number) then the only way to find the next word would be to
> linearly search through all the results looking for the next position..
> If you had an additional index of postions to words, then you might have
> something... Have you tried this "location" approach before?

You would look up the words first. There would be only one instance
of the word indexed per document so it can be a gdbm key. Part of the value
data would be the list of of locations. To find a phrase you would
find the first word, then the second, then check to see if you can
match a sequence where the location of the second word follows the
first one, then repeat for the rest of the words in the phrase discarding
the possible paths as soon as you see they can't work. I haven't
tried this but I would expect the computation of possible paths
to happen much faster than additional disk lookups. Plus, it allows
'near' and 'not near' searchs with a variation in the math.

  Les Mikesell
   les@mcs.com
----------------------------------------------------------------------
To unsubscribe from the htdig mailing list, send a message to
htdig-request@sdsu.edu containing the single word "unsubscribe" in
the body of the message.



This archive was generated by hypermail 2.0b3 on Sat Jan 02 1999 - 16:26:32 PST