Re: htdig: Preliminary proposal for data structures to support phrases

Jason Moore (
Mon, 8 Jun 1998 13:52:27 -0400 (EDT)

On Mon, 8 Jun 1998, Andrew Scherpbier wrote:

> Leslie Mikesell wrote:
> >
> > According to jmoore:
> > > What I propose is to store the words before and after each word in the
> > > index.
> > > -------+-----------------------------------------
> > > <Word> | <Previous Word><Next Word>
> >

> > Could you instead store a list of positions within
> > the document where each word appears? Then after looking up the
> > potential matches containing all the words, discard the ones where
> > you can't assemble a consecutive list of position numbers matching
> > the words in the phrase.
> Interesting ideas. My plan to support phrase searching in ht://Dig 4 would
> add a word position to each word (first word in the document is word #0,
> second one #1, etc.)
> So, the word table looks something like this:
> create table word
> (
> word varchar(<wordlength>), // The word in question
> docid int, // reference to the document of the word
> index int, // location of word in doc, starting at 0
> anchorid int, // references to nearest named anchor
> context byte // context of word (<h1> vs. <title>, etc)
> );
> The only thing I am not sure about is whether to put the anchorid field in the
> word table or create a separate table that maps documents to anchors and their
> index in the document. It would save the 4 bytes per word record and still
> provide the anchor information.

Ah, you want to start using a relational database as the backend. Each
word is a row in the word table. ok.

(btw - what is the purpose of anchorid? It currently exists in htdig and
I couldn't figure out what it does. Does it need to be an long int?)

> With this scheme, several types of searches become possible:
> * phrase searches
> * "near" searches
> * "before" or "after" searches
> * etc.
> The only disadvantage is the space complexity compared to the GDBM variable
> length record approach currently used in ht://Dig.
> This is the main reason I want to move away from a hash-based database like
> GDBM; this approach requires non-unique keys.

I'm just wondering how you would search for a phrase with the rdb
approach. It seems to me that you would have to run a lot of sql queries.
One to find the docs and locations of the first word. Then for each
<docid> <index/location> returned from this query, another query would
have to be run looking for matches in each document with the second word
and a location of +1.. I'm not by any means a sql god but, i'd really be
interested in seeing some (pseduo) sql that would do this.


To unsubscribe from the htdig mailing list, send a message to containing the single word "unsubscribe" in
the body of the message.

This archive was generated by hypermail 2.0b3 on Sat Jan 02 1999 - 16:26:32 PST