Re: [htdig] Duration of Htsearch Processing (3.1.5)

Subject: Re: [htdig] Duration of Htsearch Processing (3.1.5)
From: Geoff Hutchison (
Date: Sat Mar 18 2000 - 16:55:08 PST

At 6:58 PM -0500 3/18/00, wrote:
>Looking at documentation, it does not appear that there is any option in
>either the conf file or the parameters passed to htsearch, to limit the
>number of matches which are located and sorted. If "several thousand"
>documents match the specified words, all of these have to participate in
>sorting; there's no way to limit the number which participate.

This has been requested in the past. The biggest problem is that it's
a bit of a chicken-and-egg problem. You want to cut out the documents
before scoring and sorting (preferably before even looking them up in
the document DB). But before you have a ranking, you don't know which
ones you want to cut exactly. After all, you don't want to cut out
the best-ranked documents!

>Appears to me that I could inspect the .wordlist file produced by htdig,
>locate the records which are resulting in unwanted matches, and remove these
>prior to running htmerge.

Yes, you can do this. Another good technique is to use the cut and
sort command-line programs to count the frequency of the words and
add overused ones to the bad_words list. One reason for doing this is
that very common words add very little information value to a query.

-Geoff Hutchison
Williams Students Online

