Re: htdig: Preliminary proposal for data structures to suppo
Mon, 8 Jun 1998 12:09:31 -0700

     Whilst I would certainly like to be able to do phrase searching the
     most important element of any search engine to me has got to be speed.
     In posts over the last few months I have noticed Andrew and several
     others suggesting a move away from GDBM (hashed type?) databases to
     btree or even SQL to improve performance.
     I see that your examples use a GDBM database and their "index will be
     at least 3 times the size of the collected documents".
     Would a different type of database be able to achieve the same thing
     more efficiently? With 2.5GB of data, a 7.5GB index would be a high
     price to pay for a useful function.
     Paul Lucas
     Frost & Sullivan

______________________________ Reply Separator _________________________________
Subject: htdig: Preliminary proposal for data structures to support p
Author: jmoore <> at Internet
Date: 6/9/98 12:11 AM

This post is for anyone who is interesting in discussing new features for
the htdig engine.
One of the main features that i would like to see in the new version of
htdig is support for searching on a phrase. As you may know, currently
only individual words are kept in the index, which makes it easy to search
for things like "purina and dogs and chow" but not "purina dog chow".
What I propose is to store the words before and after each word in the
db.words.gdbm (GDBM) would contain a structure like:
  Key | Value
  <Word> | <Previous Word><Next Word>
         | <Count><ID><Weight><Anchor><Location>
         | [repeats for each instance of <Word> in a document...]
So, the phrase "object oriented database" would be retrieved by looking up
the references to "oriented" and rejecting those that don't have the
previous word "object" or the next word "database".
(Note: this could be generalized to handle a phrase with any number of
The main problem with this approach as outlined, is that the index will be
at least 3 times the size of the collected documents since the previous and
next word is stored for each word. There are probably a lot of
optimizations that can happen here - the first is to use 2 byte short ints
for word id's and keep the mappings between the words and their IDs in a
separate gdbm file. (This will also fit in with the current setup where 5
integers are packed and assigned to a "word" in "db.words.gdbm").
Another optimization would be for all document collections to share the
same Word->ID index. (There would be some issues around to ensure that
this index could be recreated, since if it was corrupted all indexes
would have to be rebuilt - big hassle.)
Another optimization would be to drop certain words from the index.
Presumably, if stop words are dropped from the users search term and the
indexes, the slight strangeness of the results would still be intuitive
to the end user. e.g. A search for "Sid and Nancy" would be converted to
"sid nancy" and would return matches in the index to "Sid Nancy" or "Sid
and Nancy" or "Sid of Nancy" (assuming "and" and "of" were stop words).
I haven't fully worked out the details of this one yet.
I've been prototyping the data model with perl so, hopefully I can have
some more specific factual data as to the size of the indexes. I think
disk space - and balancing between the accuracy of the index and its size
will probably be the key issue. I'm including more details of the
data structures below. Please post any comments or concerns to the list.
The following 3 data structures are GDBM files.
  Name: DocumentDB
  Key | Value
  <Document ID> | <What> [char]
                 | <Time> [integer]
                 | <Accessed> ["]
                 | <State> ["]
                 | <Size> ["]
                 | <Links> ["]
                 | <ImageSize> ["]
                 | <HopCount> ["]
                 | <URL> [variable size]
                 | <Head> ["]
                 | <Title> ["]
                 | <Description>["]
                 | <Email> ["]
                 | <Notification> ["]
                 | <Subject> ["]
  Name: WordDB
  Key | Value
  <Word ID> | <Previous Word ID> [unsigned short int]
            | <Next Word ID> ["]
            | <Document ID> ["]
            | <Count> ["]
            | <Weight> ["]
            | <Anchor> ["]
            | <Location> ["]
            | [ entire structure repeats for each occurance
            | of the word in the documents... 14 bytes/occurance]
  Name: Dictionary
  Key | Value
  <Word> | <Word ID> [unsigned short int] [unique]
To unsubscribe from the htdig mailing list, send a message to containing the single word "unsubscribe" in
the body of the message.

Received: from ([]) by with
  (IMA Internet Exchange 3.01 Enterprise) id 0000FA4B; Mon, 8 Jun 98 11:29:42
Received: from ( by (Worldmail
1.3.167) for; 7 Jun 1998 23:07:13 -0500
Received: (from daemon@localhost)
        by (8.8.7/8.8.7) id WAA16003
        for htdig-out; Sun, 7 Jun 1998 22:17:33 -0700 (PDT)
Received: from sartre ( [])
        by (8.8.7/8.8.7) with ESMTP id WAA15998
        for <>; Sun, 7 Jun 1998 22:17:27 -0700 (PDT)
Received: from localhost (jmoore@localhost) by sartre (8.7.6/8.7.3) with SMTP id
AAA26797 for <>; Tue, 9 Jun 1998 00:11:43 -0400
X-Authentication-Warning: sartre: jmoore owned process doing -bs
Date: Tue, 9 Jun 1998 00:11:42 -0400 (EDT)
From: jmoore <>
X-Sender: jmoore@sartre
Subject: htdig: Preliminary proposal for data structures to support phrases
Message-ID: <Pine.LNX.3.95.980609000327.26504A-100000@sartre>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
Precedence: bulk
Reply-To: jmoore <>
To unsubscribe from the htdig mailing list, send a message to containing the single word "unsubscribe" in
the body of the message.

This archive was generated by hypermail 2.0b3 on Sat Jan 02 1999 - 16:26:32 PST