htdig: Using htdig just a few hours per day

Gianugo Rabellino (
Tue, 1 Dec 1998 16:14:21 +0100


I just started to use htdig on our site and I'm really
impressed with the results (say bye bye to Excite for
Web Servers :)). I will definitely use it to index
our site (

I have another (much bigger) project, though, and I need
your opinion to understand if it's feasible and if so
what would be the best way to implement it. While I don't
pretend to use htdig as another "Altavista", I would love
to use it to index the opera sources on the Internet. I'm
talking about maybe 2-300 sites worldwide, most of them
with just a few pages (the average might be 7-10 pages
per site), so I understand that size shouldn't be a big
issue. Should the number of sites be a problem, I might
reduce it to the most important 50-100 of them.

The biggest problem, however, would be the "digging" part:
here in Italy bandwith costs a *lot* and every single
bit is precious, so it would be really great if I could
index these sites nightly, when the bandwith usage is
much lower. I did a few tests tough, and found that I need
about 20 hours to gather all the material: I need than
to be able to do a "partial" digging (say 2 hours per night,
or even less), so that every 10-15 days the database
gets updated without hurting too much the bandwidth.

Is there a way to do that? I understand that the "-a"
option might be helpful, since it keeps a copy of the
existing database, but I don't see how I can tell
htdig to "resume" a suspended run (and even how to
suspend it: I don't know if htdig would behave if I
send him SIGSTOPs and SIGCONTs via cron).

I'd appreciate a lot even pointers to documentations,
FAQ and such. Also, if you think that there are better
softwares to accomplish this task suggestions would
be most welcome :)

Thanks in advance for your help,

Gianugo Rabellino  

OperaWeb - All you wanted to know from the Opera world! ---------------------------------------------------------------------- To unsubscribe from the htdig mailing list, send a message to containing the single word "unsubscribe" in the body of the message.

This archive was generated by hypermail 2.0b3 on Sat Jan 02 1999 - 16:29:43 PST