Re: [htdig] htdig / Suse 6.2: very long run ?


Subject: Re: [htdig] htdig / Suse 6.2: very long run ?
From: Geoff Hutchison (ghutchis@wso.williams.edu)
Date: Tue Apr 25 2000 - 14:50:52 PDT


At 12:47 AM +0300 4/26/00, Peter L. Peres wrote:
>I am using -v, -i and I have set the config to generate url and image
>lists. The machine does nothing in particular besides indexing, it is a
>headless one with ethernet connection only on my home network.

I was referring more to having other copies of htdig running.
Sometimes people have htdig running from a cron job and if a run
takes too long, cron spawns another copy and they "collide." Not
pretty. :-(

>The first indexing (of the Suse system) lasted about 2 hours if I am not
>wrong. Time is not a problem unless we are talking about weeks. I'd like
>to know if the ETA is computable ? ;-)

The ETA depends a lot on the number of pages you've added to the
previous run. From your comments, I would guess you're right and
there's some sort of loop going. See below.

>I have unceremoniously killed htdig once before, then ran htmerge on the
>result with no problems, and htsearch worked. It seems to be robust enough
>imho ;-). At least for offline machines.

Uh. Let's just say I don't recommend it unless you're also using -l.
It might work, it might not. I'll leave it at that. :-)

>I am talking about many links that are actually directories (with text and
>source code and such), that are not indexed in any particular way. Will
>htdig cope with that kind of information ?

Sure. I'm assuming these are coming up as some sort of Apache
directory index or the like. The worry is if you get into some sort
of infinite URL loop. For example, you might have something like this:

http://www.foo.com/bar -> http://www.foo.com/bar/bar -> /bar/bar/bar ...

Usually this only happens when you have some sort of dynamic or
server-parsed pages and bad typos in your links. This is why I'd look
through the URLs coming up when you run with -v. It's often very easy
to spot a URL loop yourself.

>cause 2 file swaps per document read (worst case: 1 page out, 1 page in)
>imho. Not something to worry about imho.

This isn't quite how it works. But I'll say that personally I try to
avoid swapping as much as possible. At least if I'm in the same room
as the server. :-) OK, I admit, I've heard some noisy drives in my
day.

> >Hardly. Many people regularly index many times that.
>
>Ok, are there some run time figures for this ?

Well, the machine at wso.williams.edu was indexing upwards of 90,000
URLs last time I checked. Initial indexing could take around 8-9
hours. In this case the machine is indexing another webserver and
speed is often limited by network bandwidth.

The machine hosting htdig.org did an update of the index for 15,000+
URLs via local_urls in about 35 minutes this morning.

>PS: I might have missed a few mails from the list due to the spam filter.
>If there was any mail on this $SUBJ between my (plp) and your (Geoff H.)'s
>emails, then please say so. From now on, list messages make it.

Nothing posted to the list at least.

Cheers,

--
-Geoff Hutchison
Williams Students Online
http://wso.williams.edu/

------------------------------------ To unsubscribe from the htdig mailing list, send a message to htdig-unsubscribe@htdig.org You will receive a message to confirm this.



This archive was generated by hypermail 2b28 : Tue Apr 25 2000 - 12:37:23 PDT