[htdig3-dev] Architecture Overview: Scoring


Subject: [htdig3-dev] Architecture Overview: Scoring
From: Geoff Hutchison (ghutchis@wso.williams.edu)
Date: Mon Mar 13 2000 - 19:17:30 PST


OK, this is a bit late, I had a pile of things that were due today.
Fortunately, I'm almost done with my graduate classes (Friday), which
will be quite nice.

Scoring in version 3.2:

In versions 3.1 and before, scoring was done at the time of indexing.
This made scoring during the search quite easy (it was mostly
pre-computed), but is a real hassle if you're trying to optimize the
default scoring factors. Since the defaults are by no means the best
possible values for all people, this essentially prevents
experimentation.

As outlined in previous overviews, the words themselves in 3.2 are
stored with a set of "flags" representing the context. So the flags
are associated with various factors and currently, htsearch loops
through and sums up the factors for each matching word in a document.
Note that unlike versions before 3.2, the position in the document
doesn't play a part in scoring. (Previous versions scaled the
character position from 1-1000 and gave a factor of 1000 to appearing
in the beginning and decreasing down to a factor of 1 to appearing at
the end.)

So let's run through the scoring for two words, foo and foobar. Let's
say for the sake of argument that foobar was generated by a fuzzy
algorithm and has a search_algorithm weighting of 0.5.

Now in document A, "foo" occurs 10 times, with total weight 350 and
"foobar" occurs 5 times, with total weight 200 (e.g. they all appear
as headers). Let's also say in the total database, "foo" occurs 250
times and "foobar" occurs 100 times.

Without referring to a formula, we know that we have to balance the
number of occurrences in the document against how common the word is.
Currently, it's difficult to work out the number of occurrences in a
document. However, it's easy to work out the total number of
occurences in a word.

So for document A, the score from the words is goes about like this:
Sum(Fuzzy_Factor * Word_Weight / Total_Word_Frequency)

word_score = 1*350/250 + 0.5*200/100 = 1.4 + 1 = 2.4

Currently, there are two non-word factors: backlink_factor and
date_factor. Another reasonable one would be hopcount_factor, and of
course Hans-Peter's url_seed_score modifications would fit in here as
well. These simply add to the document weighting based on other
attributes of the document.

In the current code, before reporting the score (and sorting),
htsearch takes the natural log of this value. Why? This is an attempt
to make it a bit more even--you have to have an order of magnitude
more weight to have a factor more score. This doesn't entirely
balance out the extra weight given to long documents, but it helps.

There are, of course, many variations on this theme. Almost any IR
book will describe a few variants. However, this improves on previous
scoring mechanisms by taking total word frequency into account and
attempting to balance out long documents. Testing would be helpful to
see if it actually works!

-Geoff

------------------------------------
To unsubscribe from the htdig3-dev mailing list, send a message to
htdig3-dev-unsubscribe@htdig.org
You will receive a message to confirm this.



This archive was generated by hypermail 2b28 : Mon Mar 13 2000 - 19:23:08 PST