Re: [htdig] Htmerge: "Deleted, invalid"


Subject: Re: [htdig] Htmerge: "Deleted, invalid"
From: Gilles Detillieux (grdetil@scrc.umanitoba.ca)
Date: Wed Jul 12 2000 - 09:08:26 PDT


According to David Adams:
> Why does htmerge 3.1.5 flag some pages, which look OK to me, as
> "Deleted, invalid" and not index them?
>
> This is happening not just with .html pages but also .doc and .pdf files.
>
> It happens with a simple merge following a run of htdig -i -a
> and also when two htdig runs are merged using the htdig -m option.

htmerge does this when the remove_bad_urls attribute is true, and the
page in question is not found (404 error), the server name no longer
exists, the server is down, or in the case of an update dig, the page
has been updated, superceding the old document database record for it.
In the latter case, htdig creates a new record for the updated document,
with a new DocID, so the old one is discarded. As this only happens in
update digs, it wouldn't be the case during an htdig -i, so I'd look at
the other possibilities.

In any case, run both htdig and htmerge with at least two verbose options,
and cross-reference the DocID of the "Deleted, invalid" messages to other
messages with the same ID, to get a clearer picture of what's happening.

-- 
Gilles R. Detillieux              E-mail: <grdetil@scrc.umanitoba.ca>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:    (204)789-3930

------------------------------------ To unsubscribe from the htdig mailing list, send a message to htdig-unsubscribe@htdig.org You will receive a message to confirm this.



This archive was generated by hypermail 2b28 : Wed Jul 12 2000 - 06:24:22 PDT