Re: htdig: Stars...

Andrew Scherpbier (
Fri, 13 Feb 1998 09:40:28 -0800

Erik Campo wrote:
> Hi.
> I would like to know how the star thing works. Why do one result
> with more stars than another matches better the words I requested?
> I just can't seem to find out how this works.
> Thanks.

Here is the algorithm used by htdig and htsearch.

Every document is parsed into individual words. Each word has a context that
is defined by the surrounding HTML. For example, words that are within
<h1>...</h1> have a different context as words in the document title.
Each context has a weight associated with it so that some contexts are more
important than others. (Look at attributes that have "factor" in them like
'title_factor' and 'heading_factor_4')
In addition to the context of a word, the location of the word within the
document is used to assign significance to the word; words that appear at the
beginning of a document are given more importance than words at the end.
Lastly, the number of times a word occurs within the same document is also
All these things combined will give a particular word in a document a combined
weight that is stored in the word database.

The task of htsearch is to find documents that are relevant to the search
Although the actual algorithm is fairly complicated because of the boolean
expressions parsing and fuzzy searching, the algorithm basically goes
something like this:
Each of the words is looked up and a list of documents that the occur in is
generated. Each document is now assigned a weight that is computed from the
combined weight of all the words that got it into the result list. Once all
documents have been identified, they are now sorted by weight. The document
with the highest weight is assigned the maximum number of stars and the number
of stars for all other documents is scaled down from there.

I hope this answers your question.

Andrew Scherpbier <>
Contigo Software <>
To unsubscribe from the htdig mailing list, send a message to containing the single word "unsubscribe" in
the body of the message.

This archive was generated by hypermail 2.0b3 on Sat Jan 02 1999 - 16:25:41 PST