htdig: Preliminary proposal for data structures to support phrases


jmoore (jmoore@sober.com)
Tue, 9 Jun 1998 00:11:42 -0400 (EDT)


This post is for anyone who is interesting in discussing new features for
the htdig engine.

One of the main features that i would like to see in the new version of
htdig is support for searching on a phrase. As you may know, currently
only individual words are kept in the index, which makes it easy to search
for things like "purina and dogs and chow" but not "purina dog chow".
What I propose is to store the words before and after each word in the
index.

e.g.

db.words.gdbm (GDBM) would contain a structure like:

  Key | Value
  -------+-----------------------------------------
  <Word> | <Previous Word><Next Word>
         | <Count><ID><Weight><Anchor><Location>
         | [repeats for each instance of <Word> in a document...]

So, the phrase "object oriented database" would be retrieved by looking up
the references to "oriented" and rejecting those that don't have the
previous word "object" or the next word "database".

(Note: this could be generalized to handle a phrase with any number of
words.)

The main problem with this approach as outlined, is that the index will be
at least 3 times the size of the collected documents since the previous
and next word is stored for each word. There are probably a lot of
optimizations that can happen here - the first is to use 2 byte short ints
for word id's and keep the mappings between the words and their IDs in a
separate gdbm file. (This will also fit in with the current setup where 5
integers are packed and assigned to a "word" in "db.words.gdbm").

Another optimization would be for all document collections to share the
same Word->ID index. (There would be some issues around to ensure that
this index could be recreated, since if it was corrupted all indexes
would have to be rebuilt - big hassle.)

Another optimization would be to drop certain words from the index.
Presumably, if stop words are dropped from the users search term and the
indexes, the slight strangeness of the results would still be intuitive
to the end user. e.g. A search for "Sid and Nancy" would be converted to
"sid nancy" and would return matches in the index to "Sid Nancy" or "Sid
and Nancy" or "Sid of Nancy" (assuming "and" and "of" were stop words).
I haven't fully worked out the details of this one yet.

I've been prototyping the data model with perl so, hopefully I can have
some more specific factual data as to the size of the indexes. I think
disk space - and balancing between the accuracy of the index and its size
will probably be the key issue. I'm including more details of the
data structures below. Please post any comments or concerns to the list.

:jason

jmoore@sober.com

The following 3 data structures are GDBM files.

  Name: DocumentDB
  Key | Value
  ---------------+---------------------------------
  <Document ID> | <What> [char]
                 | <Time> [integer]
                 | <Accessed> ["]
                 | <State> ["]
                 | <Size> ["]
                 | <Links> ["]
                 | <ImageSize> ["]
                 | <HopCount> ["]
                 | <URL> [variable size]
                 | <Head> ["]
                 | <Title> ["]
                 | <Description>["]
                 | <Email> ["]
                 | <Notification> ["]
                 | <Subject> ["]

  Name: WordDB
  Key | Value
  ----------+--------------------------------------
  <Word ID> | <Previous Word ID> [unsigned short int]
            | <Next Word ID> ["]
            | <Document ID> ["]
            | <Count> ["]
            | <Weight> ["]
            | <Anchor> ["]
            | <Location> ["]
            | [ entire structure repeats for each occurance
            | of the word in the documents... 14 bytes/occurance]

  Name: Dictionary
  Key | Value
  ----------+--------------------------------------
  <Word> | <Word ID> [unsigned short int] [unique]
  

----------------------------------------------------------------------
To unsubscribe from the htdig mailing list, send a message to
htdig-request@sdsu.edu containing the single word "unsubscribe" in
the body of the message.



This archive was generated by hypermail 2.0b3 on Sat Jan 02 1999 - 16:26:32 PST