Re: htdig: Using htdig just a few hours per day

Gianugo Rabellino (
Wed, 2 Dec 1998 19:03:15 +0100

On Tue, Dec 01, 1998 at 09:19:39PM -0500, Geoff Hutchison wrote:

> >per site), so I understand that size shouldn't be a big
> >issue. Should the number of sites be a problem, I might
> >reduce it to the most important 50-100 of them.
> Not really a problem. The problems with size have more to do with the
> number of pages and words and the amount of each page stored.

Well, I did a test and successfully indexed 90 sites. With a database
of more than 100 Megs, searches on my overloaded P100/32MB with
full X11 & Netscape & lots of stuff running (no, this won't be the
final server :)) take at most a couple of seconds, so I'm quite
impressed and happy with the results :)

What scared me was the disk space required by the sorting process:
I had to repartition since 700 Megs weren't enough! It used some
900 MB, quite a high figure (well... luckily iron is cheap :) Do
you have a formula to calculate the required space for the sorting
process? I understand that the figures cited in the FAQs refer to
the final database...
> >Is there a way to do that? I understand that the "-a"
> >option might be helpful, since it keeps a copy of the
> >existing database, but I don't see how I can tell
> >htdig to "resume" a suspended run (and even how to
> >suspend it: I don't know if htdig would behave if I
> >send him SIGSTOPs and SIGCONTs via cron).
> Well you can always try it. To save the most time on htdig, make sure you
> don't use "-i" and that there's an old copy of the database around (e.g.
> with "-a" make sure there are .work files from previous runs).

I did a small test... SIGSTOPping the process and SIGCONTinuing
it after a couple of hours did the job. Yet what really scares
me is the chance of a forced reboot: given an approx. time
of two weeks to complete the indexing process it isn't a chance
so remote... and Murphy states that the reboot would happen
after 14 days and some...:/

Would it be so difficult to code htdig so that if it receives
a signal it dies gracefully, dumping a state in such a way that
it would be easy to resume from where it stopped? Don't you
think this might be useful not only for me but even for others?
> This will also get easier when I actually sit down and write the code to
> merge multiple databases.

This will for sure be nice! By merging do you mean that there will
be a unique, huge, db made by the merging process or queries will
be made to different smaller databases?

Gianugo Rabellino  

OperaWeb - All you wanted to know from the Opera world! Gianugo ---------------------------------------------------------------------- To unsubscribe from the htdig mailing list, send a message to containing the single word "unsubscribe" in the body of the message.

This archive was generated by hypermail 2.0b3 on Sat Jan 02 1999 - 16:29:44 PST