Re: [htdig] Htmerge: "Deleted, invalid"

Subject: Re: [htdig] Htmerge: "Deleted, invalid"
Date: Fri Jul 14 2000 - 04:11:07 PDT

Sorry for the length of this!

> According to David Adams:
> > Why does htmerge 3.1.5 flag some pages, which look OK to me, as
> > "Deleted, invalid" and not index them?
> >
> > This is happening not just with .html pages but also .doc and .pdf files.
> >
> > It happens with a simple merge following a run of htdig -i -a
> > and also when two htdig runs are merged using the htdig -m option.
> htmerge does this when the remove_bad_urls attribute is true, and the
> page in question is not found (404 error), the server name no longer
> exists, the server is down, or in the case of an update dig, the page
> has been updated, superceding the old document database record for it.
> In the latter case, htdig creates a new record for the updated document,
> with a new DocID, so the old one is discarded. As this only happens in
> update digs, it wouldn't be the case during an htdig -i, so I'd look at
> the other possibilities.
> In any case, run both htdig and htmerge with at least two verbose options,
> and cross-reference the DocID of the "Deleted, invalid" messages to other
> messages with the same ID, to get a clearer picture of what's happening.
> --
> Gilles R. Detillieux E-mail: <>
> Spinal Cord Research Centre WWW:
> Dept. Physiology, U. of Manitoba Phone: (204)789-3766
> Winnipeg, MB R3E 3J7 (Canada) Fax: (204)789-3930

I've run htdig -vv followed by htmerge -vvv and I still cannot see
any reason why htmerge decides, apparently arbitrarily, that a page is
invalid. None of the reasons given above seem to fit.

I'll take a single example:, is
one of many in the limit_urls_to directive.

Htdig finds and then
amongst others.

Grepping for "churchpage" in the htmerge log I find:

htmerge: Merged URL:
htmerge: Merged URL:
htmerge: Merged URL:
htmerge: Merged URL:
htmerge: Merged URL:
htmerge: Merged URL:
htmerge: Merged URL:
Deleted, invalid: 1900/
Deleted, invalid: 1901/

So I try an experiment: I reduce limit_urls_to include only the starting URL
and and run htdig & htmerge.

Then htmerge reports:

htmerge: Total word count: 3806
htmerge: 10

I do not accept that pages 4 & 5 just happened to unavailable on the
first occasion and available on the second. Nor can I see any
differences in the htdig logs for these pages. The same sizes are
reported in both cases.

I think there is a bug in htmerge 3.1.5 which causes it to declare
some pages as "invalid" in some cases.

David J Adams
Computing Services
University of Southampton

------------------------------------ To unsubscribe from the htdig mailing list, send a message to You will receive a message to confirm this.

This archive was generated by hypermail 2b28 : Fri Jul 14 2000 - 01:27:10 PDT