Subject: Re: [htdig] Htmerge: "Deleted, invalid"
Date: Fri Jul 14 2000 - 04:11:07 PDT
Sorry for the length of this!
> According to David Adams:
> > Why does htmerge 3.1.5 flag some pages, which look OK to me, as
> > "Deleted, invalid" and not index them?
> > This is happening not just with .html pages but also .doc and .pdf files.
> > It happens with a simple merge following a run of htdig -i -a
> > and also when two htdig runs are merged using the htdig -m option.
> htmerge does this when the remove_bad_urls attribute is true, and the
> page in question is not found (404 error), the server name no longer
> exists, the server is down, or in the case of an update dig, the page
> has been updated, superceding the old document database record for it.
> In the latter case, htdig creates a new record for the updated document,
> with a new DocID, so the old one is discarded. As this only happens in
> update digs, it wouldn't be the case during an htdig -i, so I'd look at
> the other possibilities.
> In any case, run both htdig and htmerge with at least two verbose options,
> and cross-reference the DocID of the "Deleted, invalid" messages to other
> messages with the same ID, to get a clearer picture of what's happening.
> Gilles R. Detillieux E-mail: <email@example.com>
> Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/~grdetil
> Dept. Physiology, U. of Manitoba Phone: (204)789-3766
> Winnipeg, MB R3E 3J7 (Canada) Fax: (204)789-3930
I've run htdig -vv followed by htmerge -vvv and I still cannot see
any reason why htmerge decides, apparently arbitrarily, that a page is
invalid. None of the reasons given above seem to fit.
I'll take a single example: http://www.tregalic.co.uk/sacred-heart/, is
one of many in the limit_urls_to directive.
Htdig finds http://www.tregalic.co.uk/sacred-heart/ and then
Grepping for "churchpage" in the htmerge log I find:
htmerge: Merged URL: http://www.tregalic.co.uk/sacred-heart/churchpage1.html
htmerge: Merged URL: http://www.tregalic.co.uk/sacred-heart/churchpage2.html
htmerge: Merged URL: http://www.tregalic.co.uk/sacred-heart/churchpage3.html
htmerge: Merged URL: http://www.tregalic.co.uk/sacred-heart/churchpage4.html
htmerge: Merged URL: http://www.tregalic.co.uk/sacred-heart/churchpage5.html
htmerge: Merged URL: http://www.tregalic.co.uk/sacred-heart/churchpage6.html
htmerge: Merged URL: http://www.tregalic.co.uk/sacred-heart/churchpage7.html
Deleted, invalid: 1900/http://www.tregalic.co.uk/sacred-heart/churchpage4.html
Deleted, invalid: 1901/http://www.tregalic.co.uk/sacred-heart/churchpage5.html
So I try an experiment: I reduce limit_urls_to include only the starting URL
and http://www.tregalic.co.uk/sacred-heart/ and run htdig & htmerge.
Then htmerge reports:
htmerge: Total word count: 3806
I do not accept that pages 4 & 5 just happened to unavailable on the
first occasion and available on the second. Nor can I see any
differences in the htdig logs for these pages. The same sizes are
reported in both cases.
I think there is a bug in htmerge 3.1.5 which causes it to declare
some pages as "invalid" in some cases.
-- David J Adams <D.J.Adams@soton.ac.uk> Computing Services University of Southampton
------------------------------------ To unsubscribe from the htdig mailing list, send a message to firstname.lastname@example.org You will receive a message to confirm this.
This archive was generated by hypermail 2b28 : Fri Jul 14 2000 - 01:27:10 PDT