Re: [htdig] Htmerge: "Deleted, invalid"


Subject: Re: [htdig] Htmerge: "Deleted, invalid"
From: Gilles Detillieux (grdetil@scrc.umanitoba.ca)
Date: Tue Jul 25 2000 - 09:40:31 PDT


According to D.J.Adams@soton.ac.uk:
> How did I conclude that htdig is having no such problems?
> Two reasons:
> 1). At least one page on our main server, covered by my
> http_proxy_exclude statement, is "Deleted, invalid".

OK, so would suggest the problem isn't limited to proxies.

> 2). When I do not use http_proxy then htdig -v gives clear
> messages, such as "Unable to connect to server" and
> "Server not responding".
> With http_proxy I get no such messages, not even with htdig -vvv
>
> Additionally:
> 3). I can access the pages using IE (same proxy) the same day,
> no problem.
> 4). One or two pages from a site may be affected while others
> are not.

Right, you did mention these two points much earlier. I was forgetting about
that.

> I have now re-run the index with htdig -i -vvv etc. I have rather a lot of
> information to go through, but I've found nothing yet.
>
> And that nothing is significant. What do you make of this, the log from htmerge
> includes:
>
> Deleted, invalid: 2200/http://www.folkmania.org.uk/LeeZachinfo.htm
>
> While the log from htdig includes this (slightly mangled by "more" command), which looks OK to me:
>
> pick: www.folkmania.org.uk, # servers = 246
> 1226:895:2:http://www.folkmania.org.uk/LeeZachinfo.htm: Retrieval command for http://www.folkmania.org.uk/LeeZachinfo.htm: GET http://www.folkmania.org.uk/Lee
> Zachinfo.htm HTTP/1.0
> User-Agent: htdig/3.1.5 (D.J.Adams@soton.ac.uk)
> Referer: http://www.folkmania.org.uk/
> Host: www.folkmania.org.uk
>
> Header line: HTTP/1.0 200 OK
> Header line: Server: thttpd/2.07 02dec99
> Header line: Content-Type: text/html
> Header line: Date: Mon, 24 Jul 2000 03:35:01 GMT
> Header line: Last-Modified: Fri, 23 Jun 2000 18:34:50 GMT
> Translated Fri, 23 Jun 2000 18:34:50 GMT to 2000-06-23 18:34:50 (100)
> And converted to Fri, 23 Jun 2000 18:34:50
> Header line: Accept-Ranges: bytes
> Header line: Content-Length: 4586
> Header line: Age: 127170
> Header line: X-Cache: HIT from www-cacheb.soton.ac.uk
> Header line: X-Cache-Lookup: HIT from www-cacheb.soton.ac.uk:3128
> Header line: X-Cache: MISS from www-cachea.soton.ac.uk
> Header line: X-Cache-Lookup: MISS from www-cachea.soton.ac.uk:3128
> Header line: Proxy-Connection: close
> Header line:
> returnStatus = 0
> Read 4586 from document
> Read a total of 4586 bytes
>
> title: LeeZachInfo
> [snip]
> size = 4586

Hmm, you snipped just as it was getting interesting. I assume that there
were lots of entries for words being indexed, tags being parsed, and such?

> I can add another theory:
>
> It is a bug when merging a second index
> - all the "Deleted, invalid" pages come from the htdig run specified
> with the htmerge -m option
>
> This theory is easy to check out, I'll investigate tomorrow.

OK, this brings a question to mind. Did you run htmerge separately
on each of the two databases created by the htdig runs, before running
htmerge to merge the two databases together? I think that, as a minimum,
you must run htmerge after htdig to clean up the database before using
it as the -m option for a merge. You may have to clean up the target
database too - I'm not completely certain about that, but I know it
can't hurt.

-- 
Gilles R. Detillieux              E-mail: <grdetil@scrc.umanitoba.ca>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:    (204)789-3930

------------------------------------ To unsubscribe from the htdig mailing list, send a message to htdig-unsubscribe@htdig.org You will receive a message to confirm this.



This archive was generated by hypermail 2b28 : Mon Jul 24 2000 - 23:38:48 PDT