htdig: info about the theory


Hubert Razack (razack@mygale.org)
Thu, 02 Jul 1998 10:43:15 +0200


Hi, and thanks for those who answer my previous mail (actually, that was
pretty obvious, but well, after a long work day, you know what it is ...).

Now I'm interested in the theory behind htdig. I mean the way it computes
the weight for each word, and how it computes the score for each document.

With the source code, I've seen that the weight is computed this way :
(for each occurence) weight=weight+(1000-location)*weight_factor
With location being normalized hence being between 0 and 1000

But how is the score of a complete document computed ? Obviously, it's not
a vector space model (with the score being the scalar product between the
doc vector and the query one), because even if a word has weight 0, it's
still retrieved.
Is there a documentation about it somewhere ? (Since the source code is
available, I suppose the theory must be somewhere)

And a last question : how far is htdig from the "big" search engines
(altavista, infoseek, ...) Is it just a question of power, or are the
retrieval algorithm completely different ?

Thanks,

        - Hubert -
razack@mygale.org
http://www.mygale.org/07/razack
----------------------------------------------------------------------
To unsubscribe from the htdig mailing list, send a message to
htdig-request@sdsu.edu containing the single word "unsubscribe" in
the body of the message.



This archive was generated by hypermail 2.0b3 on Sat Jan 02 1999 - 16:26:50 PST