Re: htdig: Preliminary proposal for data structures to support phrases


Jason Moore (jmoore@informinteractive.com)
Mon, 8 Jun 1998 13:52:27 -0400 (EDT)


On Mon, 8 Jun 1998, Andrew Scherpbier wrote:

> Leslie Mikesell wrote:
> >
> > According to jmoore:
> > > What I propose is to store the words before and after each word in the
> > > index.
> > > -------+-----------------------------------------
> > > <Word> | <Previous Word><Next Word>
> >

> > Could you instead store a list of positions within
> > the document where each word appears? Then after looking up the
> > potential matches containing all the words, discard the ones where
> > you can't assemble a consecutive list of position numbers matching
> > the words in the phrase.
>
> Interesting ideas. My plan to support phrase searching in ht://Dig 4 would
> add a word position to each word (first word in the document is word #0,
> second one #1, etc.)
> So, the word table looks something like this:
>
> create table word
> (
> word varchar(<wordlength>), // The word in question
> docid int, // reference to the document of the word
> index int, // location of word in doc, starting at 0
> anchorid int, // references to nearest named anchor
> context byte // context of word (<h1> vs. <title>, etc)
> );
>
> The only thing I am not sure about is whether to put the anchorid field in the
> word table or create a separate table that maps documents to anchors and their
> index in the document. It would save the 4 bytes per word record and still
> provide the anchor information.
>

Ah, you want to start using a relational database as the backend. Each
word is a row in the word table. ok.

(btw - what is the purpose of anchorid? It currently exists in htdig and
I couldn't figure out what it does. Does it need to be an long int?)

> With this scheme, several types of searches become possible:
> * phrase searches
> * "near" searches
> * "before" or "after" searches
> * etc.
>
> The only disadvantage is the space complexity compared to the GDBM variable
> length record approach currently used in ht://Dig.
> This is the main reason I want to move away from a hash-based database like
> GDBM; this approach requires non-unique keys.
>

I'm just wondering how you would search for a phrase with the rdb
approach. It seems to me that you would have to run a lot of sql queries.
One to find the docs and locations of the first word. Then for each
<docid> <index/location> returned from this query, another query would
have to be run looking for matches in each document with the second word
and a location of +1.. I'm not by any means a sql god but, i'd really be
interested in seeing some (pseduo) sql that would do this.

:jason

jmoore@sober.com

----------------------------------------------------------------------
To unsubscribe from the htdig mailing list, send a message to
htdig-request@sdsu.edu containing the single word "unsubscribe" in
the body of the message.



This archive was generated by hypermail 2.0b3 on Sat Jan 02 1999 - 16:26:32 PST