Re: [htdig] Htmerge: "Deleted, invalid"


Subject: Re: [htdig] Htmerge: "Deleted, invalid"
From: Gilles Detillieux (grdetil@scrc.umanitoba.ca)
Date: Mon Jul 24 2000 - 11:52:14 PDT


According to David Adams:
> I have been using htdig (3.1.2 and then 3.1.5) on an IRIX system for about a
> year and I have been very pleased with it. I would say that we've given it a
> good workout here. The problem with the "Deleted, invalid" messages only
> occurs with a second, relatively new search index.

I guess I should have read your message before responding to Geoff's!

> The first index is made from a single run of htdig covering 33 servers, all in
> the local domain, and on this week's initial dig htmerge reports 49,233
> documents and not a single "Deleted, invalid".
>
> The second index is made from two runs of htdig covering a total 969 (yes 969
> !) servers using a proxy. Htmerge reports a mere 3,096 documents and 86
> "Deleted, invalid".
>
> I have looked at the db.wordlist files (which are written to only by htdig - is
> that right?)

Yes and no. htdig creates and writes the initial db.wordlist, then htmerge
sorts it, merges words together, and processes flags for page removals. It
then rewrites this file before creating the word index database.

> and it would appear that htdig is flagging the pages for htmerge
> to delete and is not finding any words in them.
>
> I can advance these theories:
>
> It is not a bug, but is due to the use of a proxy. (I use a proxy
> because without one, a portion of the sites on any run of htdig were
> found to be not responding or even unknown. With a proxy, htdig appears
> to have no such problems.)

Hold on there! The problem of sites being down (unknown or not
responding) is exactly the sort of thing that causes the "Deleted,
invalid" situation, and I said so last week. How did you conclude that
htdig appears to have no such problems with a proxy, when it does indeed
appear to be having exactly that problem? It would make sense that if
a site is not responding, the proxy would inform htdig of this (unless
it happened to quietly substitute a cached copy of the requested page
- assuming it had one), and htdig would respond the same way it would
without a proxy. I think this is the most likely theory.

> It is a bug due to the use of a proxy.
>
> It is a bug which only shows when compiled under IRIX.
>
> It is a bug which only occurs when there many different servers.
>
> I intend to re-build the second index using htdig -vvv and perhaps learn
> something.

The only sure way to rule out an SGI compiler or IRIX-specific problem
would be to run htdig on a Linux box with the same configuration and
the same proxy, and see if you get the same results. However, based on
what you said about a portion of the sites not responding, I'd guess
this is a more likely problem. I guess there could also be a problem
with the proxy server itself, causing it to act like a server is down
when it isn't. You may want to try different proxies as well. In any
case, a close look at htdig -vvv output should give some clues.

-- 
Gilles R. Detillieux              E-mail: <grdetil@scrc.umanitoba.ca>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:    (204)789-3930

------------------------------------ To unsubscribe from the htdig mailing list, send a message to htdig-unsubscribe@htdig.org You will receive a message to confirm this.



This archive was generated by hypermail 2b28 : Mon Jul 24 2000 - 01:50:34 PDT