Re: htdig: Patch for bug: non-word (non-punctuation) characters included from descriptions


Geoff Hutchison (ghutchis@wso.williams.edu)
Wed, 6 Jan 1999 00:29:35 -0500


>Non-word characters are painstakingly removed from words in the
>text proper (see HTML.cc), but kept in descriptions for obvious
>reasons. However, when the words from a description are
>weighted and added to the list of words to index, these
>characters go in for free (and I don't mean the
>"valid_punctuation" characters).

Aha. To think all along I was seeing the same bug but not realizing what
was going on.

>(BTW, the words are counted twice; once for being in the text,
>and once for being in a description. I believe it's ok, it's
>supposed to be like that.)

They're counted "twice" but in different ways. This was one of the changes
in 3.1.0b3 and it's a big change, so I'll explain it. The words count
towards the document they're in, but under the text_factor. So they count,
but not much by default.

They also count towards the document pointed to by the anchor tag. But this
time, they count with description_factor, so by default they count more
towards this document than the document that contains the HTML.

Why? It's a little complicated, but it turns out that these "descriptions"
are usually good at summarizing the contents of that URL. So this is part
of the ranking improvements in 3.1.0b3 and later.

>Here's a patch to handle characters in description-words just
>like for other words:

Cool. As usual, it's in the CVS tree. At 5AM, do you get sleep?

-Geoff Hutchison
Williams Students Online
http://wso.williams.edu/

----------------------------------------------------------------------
To unsubscribe from the htdig mailing list, send a message to
htdig-request@sdsu.edu containing the single word "unsubscribe" in
the body of the message.



This archive was generated by hypermail 2.0b3 on Thu Jan 07 1999 - 07:52:39 PST