Re: [htdig] Htmerge: "Deleted, invalid"


Subject: Re: [htdig] Htmerge: "Deleted, invalid"
From: D.J.Adams@soton.ac.uk
Date: Tue Jul 25 2000 - 09:10:24 PDT


>
> According to David Adams:
> > I have been using htdig (3.1.2 and then 3.1.5) on an IRIX system for about a
> > year and I have been very pleased with it. I would say that we've given it a
> > good workout here. The problem with the "Deleted, invalid" messages only
> > occurs with a second, relatively new search index.
>
> I guess I should have read your message before responding to Geoff's!
>
> > The first index is made from a single run of htdig covering 33 servers, all in
> > the local domain, and on this week's initial dig htmerge reports 49,233
> > documents and not a single "Deleted, invalid".
> >
> > The second index is made from two runs of htdig covering a total 969 (yes 969
> > !) servers using a proxy. Htmerge reports a mere 3,096 documents and 86
> > "Deleted, invalid".
> >
> > I have looked at the db.wordlist files (which are written to only by htdig - is
> > that right?)
>
> Yes and no. htdig creates and writes the initial db.wordlist, then htmerge
> sorts it, merges words together, and processes flags for page removals. It
> then rewrites this file before creating the word index database.
>
> > and it would appear that htdig is flagging the pages for htmerge
> > to delete and is not finding any words in them.
> >
> > I can advance these theories:
> >
> > It is not a bug, but is due to the use of a proxy. (I use a proxy
> > because without one, a portion of the sites on any run of htdig were
> > found to be not responding or even unknown. With a proxy, htdig appears
> > to have no such problems.)
>
> Hold on there! The problem of sites being down (unknown or not
> responding) is exactly the sort of thing that causes the "Deleted,
> invalid" situation, and I said so last week. How did you conclude that
> htdig appears to have no such problems with a proxy, when it does indeed
> appear to be having exactly that problem? It would make sense that if
> a site is not responding, the proxy would inform htdig of this (unless
> it happened to quietly substitute a cached copy of the requested page
> - assuming it had one), and htdig would respond the same way it would
> without a proxy. I think this is the most likely theory.

How did I conclude that htdig is having no such problems?
Two reasons:
        1). At least one page on our main server, covered by my
                http_proxy_exclude statement, is "Deleted, invalid".
        2). When I do not use http_proxy then htdig -v gives clear
                messages, such as "Unable to connect to server" and
                "Server not responding".
                With http_proxy I get no such messages, not even with htdig -vvv

Additionally:
        3). I can access the pages using IE (same proxy) the same day,
                no problem.
        4). One or two pages from a site may be affected while others
                are not.

I have now re-run the index with htdig -i -vvv etc. I have rather a lot of
information to go through, but I've found nothing yet.

And that nothing is significant. What do you make of this, the log from htmerge
includes:

Deleted, invalid: 2200/http://www.folkmania.org.uk/LeeZachinfo.htm

While the log from htdig includes this (slightly mangled by "more" command), which looks OK to me:

pick: www.folkmania.org.uk, # servers = 246
1226:895:2:http://www.folkmania.org.uk/LeeZachinfo.htm: Retrieval command for http://www.folkmania.org.uk/LeeZachinfo.htm: GET http://www.folkmania.org.uk/Lee
Zachinfo.htm HTTP/1.0
User-Agent: htdig/3.1.5 (D.J.Adams@soton.ac.uk)
Referer: http://www.folkmania.org.uk/
Host: www.folkmania.org.uk

Header line: HTTP/1.0 200 OK
Header line: Server: thttpd/2.07 02dec99
Header line: Content-Type: text/html
Header line: Date: Mon, 24 Jul 2000 03:35:01 GMT
Header line: Last-Modified: Fri, 23 Jun 2000 18:34:50 GMT
Translated Fri, 23 Jun 2000 18:34:50 GMT to 2000-06-23 18:34:50 (100)
And converted to Fri, 23 Jun 2000 18:34:50
Header line: Accept-Ranges: bytes
Header line: Content-Length: 4586
Header line: Age: 127170
Header line: X-Cache: HIT from www-cacheb.soton.ac.uk
Header line: X-Cache-Lookup: HIT from www-cacheb.soton.ac.uk:3128
Header line: X-Cache: MISS from www-cachea.soton.ac.uk
Header line: X-Cache-Lookup: MISS from www-cachea.soton.ac.uk:3128
Header line: Proxy-Connection: close
Header line:
returnStatus = 0
Read 4586 from document
Read a total of 4586 bytes

title: LeeZachInfo
[snip]
 size = 4586

And that page is only retrieved once.

>
> > It is a bug due to the use of a proxy.
> >
> > It is a bug which only shows when compiled under IRIX.
> >
> > It is a bug which only occurs when there many different servers.
> >

I can add another theory:

        It is a bug when merging a second index
         - all the "Deleted, invalid" pages come from the htdig run specified
           with the htmerge -m option

This theory is easy to check out, I'll investigate tomorrow.

> > I intend to re-build the second index using htdig -vvv and perhaps learn
> > something.
>
> The only sure way to rule out an SGI compiler or IRIX-specific problem
> would be to run htdig on a Linux box with the same configuration and
> the same proxy, and see if you get the same results. However, based on
> what you said about a portion of the sites not responding, I'd guess
> this is a more likely problem. I guess there could also be a problem
> with the proxy server itself, causing it to act like a server is down
> when it isn't. You may want to try different proxies as well. In any
> case, a close look at htdig -vvv output should give some clues.
>

I will try htdig under RedHat Linux if and when time permits.

> --
> Gilles R. Detillieux E-mail: <grdetil@scrc.umanitoba.ca>
> Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/~grdetil
> Dept. Physiology, U. of Manitoba Phone: (204)789-3766
> Winnipeg, MB R3E 3J7 (Canada) Fax: (204)789-3930
>

-- 
 
David J Adams
<D.J.Adams@soton.ac.uk>
Computing Services
University of Southampton

------------------------------------ To unsubscribe from the htdig mailing list, send a message to htdig-unsubscribe@htdig.org You will receive a message to confirm this.



This archive was generated by hypermail 2b28 : Mon Jul 24 2000 - 23:08:49 PDT