Re: [htdig] htdig / Suse 6.2: very long run ?


Subject: Re: [htdig] htdig / Suse 6.2: very long run ?
From: Peter L. Peres (plp@actcom.co.il)
Date: Tue Apr 25 2000 - 14:47:18 PDT


On Tue, 25 Apr 2000, Geoff Hutchison wrote:

>At 9:57 PM +0300 4/25/00, Peter L. Peres wrote:
>> Then, I added my own HTML and PDF docs to the site, and things stopped
>>being ok.
>> Problem: htdig has been running for 24+ hours (i486/100MHz, 24MB RAM,
>>lots of disk space). The data to be indexed is not larger than 80MB.
>> I have run the htindex command several times so far (interrupted in the
>>middle etc). The last time(s) I generate a URL and image list.
>
>OK, my first suggestion is to disable any cron-jobs or anything that
>might write to the databases. Then I'd delete all your old ones and
>reindex from scratch. I would guess that your databases are pretty
>much dead right now. When you reindex, I'd probably use -v so you can
>see the URLs it's indexing as it goes. This way you can see if it
>gets into a loop.

I am using -v, -i and I have set the config to generate url and image
lists. The machine does nothing in particular besides indexing, it is a
headless one with ethernet connection only on my home network.

The first indexing (of the Suse system) lasted about 2 hours if I am not
wrong. Time is not a problem unless we are talking about weeks. I'd like
to know if the ETA is computable ? ;-)

>But right now if you stop it in the middle of a run, there's no
>guarantee your databases are worth much. You might want to use the -l
>flag, which will trap interruptions and attempt to exit gracefully.
>Of course thisalso means exits will take some time.

I restart with -i from scratch every time. I'm using a 2nd set of db's for
this run. The original set is preserved and I can use htsearch as I am
working (infrequently).

I have unceremoniously killed htdig once before, then ran htmerge on the
result with no problems, and htsearch worked. It seems to be robust enough
imho ;-). At least for offline machines.

>> Is there a way to do the initial dig using a list of URLs ? I am tempted
>>to make a giant URL list page using the URL list produced by htdig, after
>>running it through uniq, and then let htdig index that, with a depth of 1.
>
>Yes, you can include a file into any attribute easily, e.g.:
>start_url: `/path/to/file`
>
>A few notes:
>1) You'll want a hopcount of 0. If you index to a depth of 1, this
>will include all links from these pages.

I am talking about many links that are actually directories (with text and
source code and such), that are not indexed in any particular way. Will
htdig cope with that kind of information ?

>2) The indexer keeps a very good list of URLs--it will never reindex
>a page with the same URL. (Note the last part of that--you might have
>"the same page" with different URLs.)

Not likely, as I have set the respective equivalent names in the config
file.

>3) If you index this way, you will use a more memory since it will
>have to assemble the URL list at once. Normally, it can add a few
>links at a time to the list, so it has less overhead.

I suspect that the list will be mostly swapped out (I hope they don't mmap
the file ?!) during the run, however large the file might be. That would
cause 2 file swaps per document read (worst case: 1 page out, 1 page in)
imho. Not something to worry about imho.

>>> Have I grossly exceeded htdig's limits ? ;-)
>
>Hardly. Many people regularly index many times that.

Ok, are there some run time figures for this ?

thanks a lot,

        Peter

PS: I might have missed a few mails from the list due to the spam filter.
If there was any mail on this $SUBJ between my (plp) and your (Geoff H.)'s
emails, then please say so. From now on, list messages make it.

------------------------------------
To unsubscribe from the htdig mailing list, send a message to
htdig-unsubscribe@htdig.org
You will receive a message to confirm this.



This archive was generated by hypermail 2b28 : Tue Apr 25 2000 - 11:28:43 PDT