Re: htdig: proposal for data structure [disk usage]

Jason Moore (
Mon, 8 Jun 1998 18:39:58 -0400 (EDT)

On Mon, 8 Jun 1998 wrote:

> Whilst I would certainly like to be able to do phrase searching the
> most important element of any search engine to me has got to be speed.

I agree.

> In posts over the last few months I have noticed Andrew and several
> others suggesting a move away from GDBM (hashed type?) databases to
> btree or even SQL to improve performance.
> I see that your examples use a GDBM database and their "index will be
> at least 3 times the size of the collected documents".

I guess had better be more careful of what i say -- the 3 times was my
estimate of taking the "brute force" approach. I feel that the largest
the index should be, would be a 1:1 ratio with the data, and the
optimizations i outlined later on in my posting would bring the index size
way down. But, while we're on the topic, does anyone have any thoughts on
a reasonable ratio of index size to content?

Ideally, "stop words", "minimum word length" and other configuration
directives could be used to decrease the disk space of the index
for those who are willing to trade disk space for accuracy of the index.

> Would a different type of database be able to achieve the same thing
> more efficiently? With 2.5GB of data, a 7.5GB index would be a high
> price to pay for a useful function.

>From the mSQL faq:

  For each field in a table, mSQL will also store an additional flag byte.
  mSQL also stores an additional flag byte for each row of the table.

This is extremely low overhead - don't know about other databases. I'm
still trying to find some rules for GDBM's disk usage.


> Subject: htdig: Preliminary proposal for data structures to support p
> Author: jmoore <> at Internet
> Date: 6/9/98 12:11 AM
> This post is for anyone who is interesting in discussing new features for
> the htdig engine.
> One of the main features that i would like to see in the new version of
> htdig is support for searching on a phrase. As you may know, currently
> only individual words are kept in the index, which makes it easy to search
> for things like "purina and dogs and chow" but not "purina dog chow".
> What I propose is to store the words before and after each word in the
> index.
> e.g.
> db.words.gdbm (GDBM) would contain a structure like:
> Key | Value
> -------+-----------------------------------------
> <Word> | <Previous Word><Next Word>
> | <Count><ID><Weight><Anchor><Location>
> | [repeats for each instance of <Word> in a document...]
> So, the phrase "object oriented database" would be retrieved by looking up
> the references to "oriented" and rejecting those that don't have the
> previous word "object" or the next word "database".
> (Note: this could be generalized to handle a phrase with any number of
> words.)
> The main problem with this approach as outlined, is that the index will be
> at least 3 times the size of the collected documents since the previous and
> next word is stored for each word. There are probably a lot of
> optimizations that can happen here - the first is to use 2 byte short ints
> for word id's and keep the mappings between the words and their IDs in a
> separate gdbm file. (This will also fit in with the current setup where 5
> integers are packed and assigned to a "word" in "db.words.gdbm").
> Another optimization would be for all document collections to share the
> same Word->ID index. (There would be some issues around to ensure that
> this index could be recreated, since if it was corrupted all indexes
> would have to be rebuilt - big hassle.)
> Another optimization would be to drop certain words from the index.
> Presumably, if stop words are dropped from the users search term and the
> indexes, the slight strangeness of the results would still be intuitive
> to the end user. e.g. A search for "Sid and Nancy" would be converted to
> "sid nancy" and would return matches in the index to "Sid Nancy" or "Sid
> and Nancy" or "Sid of Nancy" (assuming "and" and "of" were stop words).
> I haven't fully worked out the details of this one yet.
> I've been prototyping the data model with perl so, hopefully I can have
> some more specific factual data as to the size of the indexes. I think
> disk space - and balancing between the accuracy of the index and its size
> will probably be the key issue. I'm including more details of the
> data structures below. Please post any comments or concerns to the list.
> :jason
> The following 3 data structures are GDBM files.
> Name: DocumentDB
> Key | Value
> ---------------+---------------------------------
> <Document ID> | <What> [char]
> | <Time> [integer]
> | <Accessed> ["]
> | <State> ["]
> | <Size> ["]
> | <Links> ["]
> | <ImageSize> ["]
> | <HopCount> ["]
> | <URL> [variable size]
> | <Head> ["]
> | <Title> ["]
> | <Description>["]
> | <Email> ["]
> | <Notification> ["]
> | <Subject> ["]
> Name: WordDB
> Key | Value
> ----------+--------------------------------------
> <Word ID> | <Previous Word ID> [unsigned short int]
> | <Next Word ID> ["]
> | <Document ID> ["]
> | <Count> ["]
> | <Weight> ["]
> | <Anchor> ["]
> | <Location> ["]
> | [ entire structure repeats for each occurance
> | of the word in the documents... 14 bytes/occurance]
> Name: Dictionary
> Key | Value
> ----------+--------------------------------------
> <Word> | <Word ID> [unsigned short int] [unique]
> ----------------------------------------------------------------------
> To unsubscribe from the htdig mailing list, send a message to
> containing the single word "unsubscribe" in
> the body of the message.

jason moore 416.595.5232 x.225
Internet Applications Developer Inform Interactive

To unsubscribe from the htdig mailing list, send a message to containing the single word "unsubscribe" in
the body of the message.

This archive was generated by hypermail 2.0b3 on Sat Jan 02 1999 - 16:26:32 PST