The htword Database Format

by Loic Dachary Copyright © 2000 Loic Dachary

Some comments on this and the word database (inverted index) structure.

The structure of the inverted index makes it possible to open a search cursor for every word in the query. Searching the first occurences of each searched terms in parallel is therefore supported. The frequency of terms may also be maintained by the inverted index. It is not maintained by default but the 'wordlist_extend: true' activates this.

The inverted index is also able to store word occurences according to relevance ranking (provided the relevance ranking of each word can be calculated at indexing time). This way the first 10 occurences of a word are always the most relevant.

Obviously there are some relevance ranking algorithms that need to work on all the occurences of the words or the documents found and in this case you have to retrieve all of them (word occurences or documents). But for simple queries with relevance ranking encoded in the inverted index, the number of word occurences that need to be retrieved for each search can be close to optimal.

I studied the search mechanism of htdig and figured out that changing it to take advantage of the index structure is not a trivial task. I did chose to focus on the index structure first and have a reliable piece of code before diving into this. The last fix commited shows that this part is quite tricky ;-)


Last modified: $Date: 2001/01/22 01:21:58 $