Re: [htdig] Htmerge: "Deleted, invalid"


Subject: Re: [htdig] Htmerge: "Deleted, invalid"
From: D.J.Adams@soton.ac.uk
Date: Fri Jul 14 2000 - 04:11:07 PDT


Sorry for the length of this!

>
> According to David Adams:
> > Why does htmerge 3.1.5 flag some pages, which look OK to me, as
> > "Deleted, invalid" and not index them?
> >
> > This is happening not just with .html pages but also .doc and .pdf files.
> >
> > It happens with a simple merge following a run of htdig -i -a
> > and also when two htdig runs are merged using the htdig -m option.
>
> htmerge does this when the remove_bad_urls attribute is true, and the
> page in question is not found (404 error), the server name no longer
> exists, the server is down, or in the case of an update dig, the page
> has been updated, superceding the old document database record for it.
> In the latter case, htdig creates a new record for the updated document,
> with a new DocID, so the old one is discarded. As this only happens in
> update digs, it wouldn't be the case during an htdig -i, so I'd look at
> the other possibilities.
>
> In any case, run both htdig and htmerge with at least two verbose options,
> and cross-reference the DocID of the "Deleted, invalid" messages to other
> messages with the same ID, to get a clearer picture of what's happening.
>
> --
> Gilles R. Detillieux E-mail: <grdetil@scrc.umanitoba.ca>
> Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/~grdetil
> Dept. Physiology, U. of Manitoba Phone: (204)789-3766
> Winnipeg, MB R3E 3J7 (Canada) Fax: (204)789-3930
>
>

I've run htdig -vv followed by htmerge -vvv and I still cannot see
any reason why htmerge decides, apparently arbitrarily, that a page is
invalid. None of the reasons given above seem to fit.

I'll take a single example: http://www.tregalic.co.uk/sacred-heart/, is
one of many in the limit_urls_to directive.

Htdig finds http://www.tregalic.co.uk/sacred-heart/ and then
        http://www.tregalic.co.uk/sacred-heart/churchpage1.html
        http://www.tregalic.co.uk/sacred-heart/churchpage2.html
                  ...
        http://www.tregalic.co.uk/sacred-heart/churchpage7.html
amongst others.

Grepping for "churchpage" in the htmerge log I find:

htmerge: Merged URL: http://www.tregalic.co.uk/sacred-heart/churchpage1.html
htmerge: Merged URL: http://www.tregalic.co.uk/sacred-heart/churchpage2.html
htmerge: Merged URL: http://www.tregalic.co.uk/sacred-heart/churchpage3.html
htmerge: Merged URL: http://www.tregalic.co.uk/sacred-heart/churchpage4.html
htmerge: Merged URL: http://www.tregalic.co.uk/sacred-heart/churchpage5.html
htmerge: Merged URL: http://www.tregalic.co.uk/sacred-heart/churchpage6.html
htmerge: Merged URL: http://www.tregalic.co.uk/sacred-heart/churchpage7.html
1897/http://www.tregalic.co.uk/sacred-heart/churchpage1.html
1898/http://www.tregalic.co.uk/sacred-heart/churchpage2.html
1899/http://www.tregalic.co.uk/sacred-heart/churchpage3.html
Deleted, invalid: 1900/http://www.tregalic.co.uk/sacred-heart/churchpage4.html
Deleted, invalid: 1901/http://www.tregalic.co.uk/sacred-heart/churchpage5.html
1902/http://www.tregalic.co.uk/sacred-heart/churchpage6.html
1903/http://www.tregalic.co.uk/sacred-heart/churchpage7.html

So I try an experiment: I reduce limit_urls_to include only the starting URL
and http://www.tregalic.co.uk/sacred-heart/ and run htdig & htmerge.

Then htmerge reports:

htmerge: Total word count: 3806
0/http://www.soton.ac.uk/services/local/alpha.html
1/http://www.tregalic.co.uk/sacred-heart/
9/http://www.tregalic.co.uk/sacred-heart/baptism.html
2/http://www.tregalic.co.uk/sacred-heart/churchpage1.html
3/http://www.tregalic.co.uk/sacred-heart/churchpage2.html
4/http://www.tregalic.co.uk/sacred-heart/churchpage3.html
5/http://www.tregalic.co.uk/sacred-heart/churchpage4.html
6/http://www.tregalic.co.uk/sacred-heart/churchpage5.html
7/http://www.tregalic.co.uk/sacred-heart/churchpage6.html
8/http://www.tregalic.co.uk/sacred-heart/churchpage7.html
htmerge: 10
12/http://www.tregalic.co.uk/sacred-heart/information.html
11/http://www.tregalic.co.uk/sacred-heart/links.html
10/http://www.tregalic.co.uk/sacred-heart/newsletter.html

I do not accept that pages 4 & 5 just happened to unavailable on the
first occasion and available on the second. Nor can I see any
differences in the htdig logs for these pages. The same sizes are
reported in both cases.

I think there is a bug in htmerge 3.1.5 which causes it to declare
some pages as "invalid" in some cases.

-- 
 
David J Adams
<D.J.Adams@soton.ac.uk>
Computing Services
University of Southampton

------------------------------------ To unsubscribe from the htdig mailing list, send a message to htdig-unsubscribe@htdig.org You will receive a message to confirm this.



This archive was generated by hypermail 2b28 : Fri Jul 14 2000 - 01:27:10 PDT