Re: htdig: Preliminary proposal for data structures to support phrases


Andrew Scherpbier (andrew@contigo.com)
Mon, 08 Jun 1998 16:11:18 +0000


Leslie Mikesell wrote:
>
> According to jmoore:
> > What I propose is to store the words before and after each word in the
> > index.
> > -------+-----------------------------------------
> > <Word> | <Previous Word><Next Word>
>
> > The main problem with this approach as outlined, is that the index will be
> > at least 3 times the size of the collected documents since the previous
> > and next word is stored for each word.
>
> But worse, you now have to store copies of every unique 3 word
> sequence in the document instead of just unique words, so frequently
> mentioned words will expand the index even more. And you still
> won't know if one 2 word sequence links up with another to complete
> a 4 word phrase. Could you instead store a list of positions within
> the document where each word appears? Then after looking up the
> potential matches containing all the words, discard the ones where
> you can't assemble a consecutive list of position numbers matching
> the words in the phrase.

Interesting ideas. My plan to support phrase searching in ht://Dig 4 would
add a word position to each word (first word in the document is word #0,
second one #1, etc.)
So, the word table looks something like this:

create table word
(
    word varchar(<wordlength>), // The word in question
    docid int, // reference to the document of the word
    index int, // location of word in doc, starting at 0
    anchorid int, // references to nearest named anchor
    context byte // context of word (<h1> vs. <title>, etc)
);

The only thing I am not sure about is whether to put the anchorid field in the
word table or create a separate table that maps documents to anchors and their
index in the document. It would save the 4 bytes per word record and still
provide the anchor information.

With this scheme, several types of searches become possible:
* phrase searches
* "near" searches
* "before" or "after" searches
* etc.

The only disadvantage is the space complexity compared to the GDBM variable
length record approach currently used in ht://Dig.
This is the main reason I want to move away from a hash-based database like
GDBM; this approach requires non-unique keys.

Thoughts/comments?

-- 
Andrew Scherpbier <andrew@contigo.com>
Contigo Software <http://www.contigo.com/>
----------------------------------------------------------------------
To unsubscribe from the htdig mailing list, send a message to
htdig-request@sdsu.edu containing the single word "unsubscribe" in
the body of the message.



This archive was generated by hypermail 2.0b3 on Sat Jan 02 1999 - 16:26:31 PST