Re: htdig: excluding anything inside the body tags

Ben Pitzer (
Thu, 06 Aug 1998 10:26:07 -0400

At 04:52 AM 8/6/98 -0400, Geoff Hutchison wrote:
>> With setting the title_factor to 10 and the text_factor
>> as well as all heading_factors to 0 we still get things that
>> are between the body tags such as links to other pages
>Well the purpose of text_factor is:
> This is a factor which will be used to multiply the
> weight of words that are not in any special part of a
> document. Setting a factor to 0 will cause normal words
> to be ignored.

It occurs to me that the definition 'not in any special part of the
document' is a tad ambiguous. In other words, would the body be considered
a 'special part' of the document? How about links? One could say that
anything which is between any specific tags is in a special part of the
document, and therefore not subject to the exclusion of the 'text_factor:
0' attribute. According to the documentation I've seen so far, the only
specific tags that htdig will look for are the <title>...</title> and
<h1>...</h1>-<h6>...</h6> tags. Do all tags which are not these tags
qualify as 'not in any special part of the document'?

>An alternative solution is to use META description tags and the patch I
>produced. No body text will appear in the output.

Unfortunately, we're trying to adjust searches on a large, extensive web
for which the installation of META tags is just not feasible. Thanks for
the idea, though.


Benjamin J. Pitzer

"I would rather be ashes than dust. I would rather that my spark should
burn out in a brilliant blaze than be stifled by dry-rot. I would rather
be a superb meteorite, every atom of me in magnificent glow, than a sleepy,
permanent planet. The proper function of man is to live, not to exist. I
will not waste my days in trying to prolong them. I will use my time."

- Jack London

To unsubscribe from the htdig mailing list, send a message to containing the single word "unsubscribe" in
the body of the message.

This archive was generated by hypermail 2.0b3 on Sat Jan 02 1999 - 16:27:17 PST