Re: [htdig] update digging


Frank Guangxin Liu (frank@ctcqnx4.ctc.cummins.com)
Tue, 2 Mar 1999 13:37:26 -0500 (EST)


>
> On Tue, 2 Mar 1999, Frank Guangxin Liu wrote:
>
> > Is it true that if I run htdig to update my db (without -i option),
> > htdig will ignore "limit_urls_to" and try all the urls in the db?
>
> Yes, I thought I already answered this question from you.

Something must be wrong because I haven't got replies to this post.
I checked the mailing list archive for Feb at www.htdig.org and couldn't
find it either. You did have answered a similiar question from
denis filipetti in the thread "htdig update is checking ALL pages
already in a DB"

> I don't know that I consider this a bug. It "updates" the db by checking

That's fine. I just want to make sure. I can run another htdig and use
the new htmerge feature.

> all the URLs in the database. Basically, it generates a list of all the
> URLs in the DB and then checks them for changes.
>
> > only several html files and found GET for all files although
> > there has been no change between the initial htdig and the update
> > htdig for this small www server).
>
> On an update dig, ht://Dig sends an If-Modified-Since header. It does this

hmm. Maybe my apache 1.2.x server doesn't support If-Modified-Since header?

> as a GET because it wants to make one connection. The HTTP specification
> says that this header will return the document if it's modified and an
> error code (I forget off the top of my head) if it's not modified.
>
> I've noticed this header does not always work. However, in version 3.1.0
> and on, after recieving the data, htdig checks the date in the header
> before parsing it. So even if the server incorrectly sends the data,

That is much better, though lots of network bandwidth is still wasted.
Is it safer to use two connections for each document (in case of update dig)?
HEAD and GET. Does the reply from HEAD provide more reliable information
and always give the last modification date?

> ht://Dig won't bother continuing.
>
> > Another strange thing is that although I deleted some html files
> > on the server http://www3.mydept.mycompany.com BEFORE the update
> > run of htdig, those deleted url still left in the db. A subsequent
>
> You have an old version of ht://Dig. This was a bug fixed in version
> 3.1.0.

hm.. I am running the latest version.
Both the initial db and the update db are created using htdig-3.1.1.
I had to re-create the initial db because the pdf_parser screwed up
in htdig-3.1.0.

>
> -Geoff Hutchison
> Williams Students Online
> http://wso.williams.edu/
>
>

------------------------------------
To unsubscribe from the htdig mailing list, send a message to
htdig@htdig.org containing the single word "unsubscribe" in
the SUBJECT of the message.



This archive was generated by hypermail 2.0b3 on Thu Mar 04 1999 - 09:09:18 PST