Re: [htdig] htdig update is checking ALL pages already in a DB

denis filipetti (
Fri, 05 Feb 1999 18:39:20 -0500

At 06:51 PM 2/4/99 -0400, Geoff Hutchison wrote:
>At 4:39 PM -0400 2/4/99, denis filipetti wrote:
>>limit_urls_to. Is that correct ? We will need to update any given page at
>Yes this is correct. It will read in all the URLs in the old database and
>use those as pages to check.
>>certain times for our users, in a DB that would be time consuming and
>>unnecessary to totally reindex. Is there any way that I can do that ?
>I'm not so sure it's "time consuming," but I guess it depends on how
>frequent you're talking about updating and how many URLs you have. Update
>digs for me, on 75,000 URLs take a total of about 35 min (including

Oppsss, the "time consuming" is in our product, which generates each and
every page. We are actually quite happy with ht/dig functionality and speed.

This particular update process runs after a user updates a page (and while
s/he waits), this is the root of our concern.

>You can always try digging the certain pages with a separate config file
>and using the new merge feature (in the snapshots or the
>hopefully-soon-to-be-released 3.1.0) to merge that database into the main

I take it that the merge detects dup URLs and dumps the one that was dug
the longest ago ? Perhaps a dig-one-page/merge-to-full-DB would be the way
for us to go ?

Before you mentioned the new merge feature I was wondering if using the -a
feature of htdig might be a solution. I now suspect that htdigs actions are
basically the same with the exception that the existing DB is first copied
then treated as if this new (.work) DB was actually called out in the .conf
file. Have I got it on that one ?

> I can't promise anything as far as speed because it still has to check
>the databases...

As long as it doesn't hit the web server for anything but the single page
I'm willing to bet the performance would be fine.

>>"not in the limits" but at other times "GET"ing that same URL (in the same
>>run) ! I suspect this dove-tails nicely with the previous question !
>There are some bugs in the string matching code in 3.1.0b4 and previous
>versions. As far as we know, all of them have been fixed in the current
>development source.

Hmm..., OK , let's see if I get this straight -> limit_urls_to is only used
in reference to start_urls, it is *not* used to control what pages are
rechecked when one doesn't use -i, yes ?

>-Geoff Hutchison
>Williams Students Online

Many thanks for the help,

To unsubscribe from the htdig mailing list, send a message to containing the single word "unsubscribe" in
the SUBJECT of the message.

This archive was generated by hypermail 2.0b3 on Wed Feb 10 1999 - 17:09:05 PST