Hubert Razack
Thu, 02 Jul 1998

Hi, and thanks for those who answer my previous mail (actually, that was
pretty obvious, but well, after a long work day, you know what it is ...).

Now I'm interested in the theory behind htdig. I mean the way it computes
the weight for each word, and how it computes the score for each document.

With the source code, I've seen that the weight is computed this way :
(for each occurence) weight=weight+(1000-location)*weight_factor
With location being normalized hence being between 0 and 1000

But how is the score of a complete document computed ? Obviously, it's not
a vector space model (with the score being the scalar product between the
doc vector and the query one), because even if a word has weight 0, it's
still retrieved.
Is there a documentation about it somewhere ? (Since the source code is
available, I suppose the theory must be somewhere)

And a last question : how far is htdig from the "big" search engines
(altavista, infoseek, ...) Is it just a question of power, or are the
retrieval algorithm completely different ?


        - Hubert -
