Re: [htdig] less files


Frank Guangxin Liu (frank@ctcqnx4.ctc.cummins.com)
Sun, 18 Jul 1999 21:17:45 -0500 (EST)


On Sun, 18 Jul 1999, Geoff Hutchison wrote:

> > since none of those existing files got changed (modified-since),
> > they won't be processed and thus those missing files
> > can't be seen by htdig.
>
> This is partly correct. If you have set remove_bad_urls, this is correct.
>
> >From the documentation (http://www.htdig.org/attrs.html#remove_bad_urls)
>
> If TRUE, htmerge will remove any URLs which were marked as unreachable by
> htdig from the database. If FALSE, it will not do this. When htdig is run
> in initial mode, documents which were referred to but could not be
> accessed should probably be removed, and hence this option should then be
> set to TRUE, however, if htdig is run to update the database, this may
> cause documents on a server which is temporarily unavailable to be
> removed. This is probably NOT what was intended, so hence this option
> should be set to FALSE in that case.
>
> > should, instead of skipping this file (won't process
> > it at all), still parse the file for links. Of course,
>
> In general, the slowest part of the indexing is retrieving the document.
> So the update dig saves a *lot* of time by just sending out
> If-Modified-Since headers. So if an update dig "reparsed looking for
> URLs," it really wouldn't be any faster than the initial dig. In that
> case, why bother doing an update dig?
It's my fault. I thought the contents of the files are already
saved in the db. "reparse looking for URLs" shouldn't require
a re-retrieval the of the file since it is not modified since.

>
> -Geoff Hutchison
> Williams Students Online
> http://wso.williams.edu/
>

------------------------------------
To unsubscribe from the htdig mailing list, send a message to
htdig@htdig.org containing the single word "unsubscribe" in
the SUBJECT of the message.



This archive was generated by hypermail 2.0b3 on Sun Jul 18 1999 - 18:35:37 PDT