Re: htdig: Preliminary proposal for data structures to support phrases

Leslie Mikesell (les@Mcs.Net)
Mon, 8 Jun 1998 17:07:56 -0500 (CDT)

According to Andrew Scherpbier:
> > But worse, you now have to store copies of every unique 3 word
> > sequence in the document instead of just unique words, so frequently
> > mentioned words will expand the index even more. And you still
> > won't know if one 2 word sequence links up with another to complete
> > a 4 word phrase. Could you instead store a list of positions within
> > the document where each word appears? Then after looking up the
> > potential matches containing all the words, discard the ones where
> > you can't assemble a consecutive list of position numbers matching
> > the words in the phrase.
> Interesting ideas. My plan to support phrase searching in ht://Dig 4 would
> add a word position to each word (first word in the document is word #0,
> second one #1, etc.)
> So, the word table looks something like this:
> create table word
> (
> word varchar(<wordlength>), // The word in question
> docid int, // reference to the document of the word
> index int, // location of word in doc, starting at 0
> anchorid int, // references to nearest named anchor
> context byte // context of word (<h1> vs. <title>, etc)
> );
> The only thing I am not sure about is whether to put the anchorid field in the
> word table or create a separate table that maps documents to anchors and their
> index in the document. It would save the 4 bytes per word record and still
> provide the anchor information.

If you go this route, I'd think the anchor info would belong in a
document record with it's own location value. When you match the
word to the document you would get the nearest anchor by finding the
anchor whose location is less than the word.

> The only disadvantage is the space complexity compared to the GDBM variable
> length record approach currently used in ht://Dig.
> This is the main reason I want to move away from a hash-based database like
> GDBM; this approach requires non-unique keys.
> Thoughts/comments?

You are always going to face a trade-off in lookup speed and ease of updates
between having many normalized records and a single variable length
record with repeating fields. You might compromise by only making one
word record per document for each unique word with a list of context,index
values describing each occurrance. That still allows you to delete and
insert single documents easily. How important is the context value? Could
you include the locations of the various contexts in document records and
figure out the context of the word from this after completing the word lookup

  Les Mikesell
To unsubscribe from the htdig mailing list, send a message to containing the single word "unsubscribe" in
the body of the message.

This archive was generated by hypermail 2.0b3 on Sat Jan 02 1999 - 16:26:32 PST