Re: [htdig] htmerge: discarding

Subject: Re: [htdig] htmerge: discarding
From: Geoff Hutchison (
Date: Thu Jun 01 2000 - 09:43:00 PDT

On Thu, 1 Jun 2000, mikeg wrote:

> htmerge: Discarding thomas in doc #9987
> htmerge: Discarding thomson in doc #7046
> htmerge: Discarding thorns in doc #1131
> What is this doing exactly. I have never noticed this up to now, although
> it's usually automated merges without any verbosity turned on...

If you take a careful look, you'll notice that these doc #s are the same
as those being discarded in the document step.

There are two main possibilities why a document is "thrown out."
1) It has been superceded: You're running an update and the document has
been modified. So it tosses the previous version (and all the associated
words) and indexes it again.
2) It is empty: Before you ask why there are words associated with empty
(or nonexistant) documents, remember that htdig also indexes the text of
links to a document. So it must get rid of this text so that you won't get
results pointing to nonexistant documents.

> My only debugging note is that I did run a partial dig, then started over
> without clearing the .work DBs, but if anything I figured that htmerge
> would notice my doubling and fix it.

You don't define what you mean by "partial dig." If you killed off
htdig before it finished, this really isn't a good idea. There's some
housekeeping it does at the end of a run and this won't get performed if
the process is killed.

> Oh yeah, and I also notice duplicates, such as:
> htmerge: Discarding zhane in doc #2872
> htmerge: Discarding zhane in doc #2872

When you're indexing link text, it will add words from different
documents. So in 3.1.x, it needs to sort the wordlist and merge these
duplicate words into one entry. If the document is to be tossed, it
doesn't bother merging them, it just tosses the duplicates.

-Geoff Hutchison
Williams Students Online

------------------------------------ To unsubscribe from the htdig mailing list, send a message to You will receive a message to confirm this.

This archive was generated by hypermail 2b28 : Thu Jun 01 2000 - 07:32:43 PDT