Re: [htdig] less files


Frank Guangxin Liu (frank@ctcqnx4.ctc.cummins.com)
Mon, 19 Jul 1999 08:44:53 -0500 (EST)


On Sun, 18 Jul 1999, Geoff Hutchison wrote:

>
>
> > since none of those existing files got changed (modified-since),
> > they won't be processed and thus those missing files
> > can't be seen by htdig.
>
> This is partly correct. If you have set remove_bad_urls, this is correct.
>
> >From the documentation (http://www.htdig.org/attrs.html#remove_bad_urls)
>
> If TRUE, htmerge will remove any URLs which were marked as unreachable by
> htdig from the database. If FALSE, it will not do this. When htdig is run
> in initial mode, documents which were referred to but could not be
> accessed should probably be removed, and hence this option should then be
> set to TRUE, however, if htdig is run to update the database, this may
> cause documents on a server which is temporarily unavailable to be
> removed. This is probably NOT what was intended, so hence this option
> should be set to FALSE in that case.

If we can modify how "remove_bad_urls" works, we may not need
to tweak TRUE/FALSE for initial or update dig.

As I understand, currently, if remove_bad_urls is set TRUE,
htdig will delete those urls that can't be accessed.
Can we change the criteria that defines "bad urls"?
"404 Not Found" should be on the top of the "criteria list".
"hostname not found" (maybe a typo or the web server is
de-commissioned and taken out from the dns...) should also be
a valid reason. Of course, we should distinguish
"hostname not found" from "resolve failed
because of dns server is down". We shouldn't delete the url
if we can't resolve the hostname because of the later reason.

If htdig can't access the url because of any other reasons (that
are not listed in the criteria), it should NOT mark it as bad
url and should NOT delete it. "other reasons" include
1) connection failed due to
   a) server machine down (like NT blue screen of death),
   b) network to the server down,
   c) httpd dead.
2) failed to retrieve the file, due to
   a) server busy?
   b) other unkown reasons.

Apparently those other reasons don't mean the url is bad,
it is just that at the time htdig is running, it isn't
accessible.

Frank

>
> > should, instead of skipping this file (won't process
> > it at all), still parse the file for links. Of course,
>
> In general, the slowest part of the indexing is retrieving the document.
> So the update dig saves a *lot* of time by just sending out
> If-Modified-Since headers. So if an update dig "reparsed looking for
> URLs," it really wouldn't be any faster than the initial dig. In that
> case, why bother doing an update dig?
>
> -Geoff Hutchison
> Williams Students Online
> http://wso.williams.edu/
>

------------------------------------
To unsubscribe from the htdig mailing list, send a message to
htdig@htdig.org containing the single word "unsubscribe" in
the SUBJECT of the message.



This archive was generated by hypermail 2.0b3 on Mon Jul 19 1999 - 06:03:18 PDT