Subject: [htdig] htdig / Suse 6.2: very long run ?
From: Peter L. Peres (firstname.lastname@example.org)
Date: Tue Apr 25 2000 - 11:57:19 PDT
I'm new to this list, and I have a question. I have a SuSe 6.2 box that
I am very happy with, excepting for htdig. I have run htdig as advised
using the original setup, and it has indexed everything in the Suse HTML
system, so I know it works.
Then, I added my own HTML and PDF docs to the site, and things stopped
Problem: htdig has been running for 24+ hours (i486/100MHz, 24MB RAM,
lots of disk space). The data to be indexed is not larger than 80MB.
I have run the htindex command several times so far (interrupted in the
middle etc). The last time(s) I generate a URL and image list.
This list was looked at using a command like:
grep <...db.urls http://myhost.here|sort -r|uniq -d|less
to find lots of duplicates. I believe that the htdig has locked itself
in a loop of URLs, although I have almost no cross-indexing (Suse HTML
pages do have plenty of that however).
So, what can I do ?
Is there a way to do the initial dig using a list of URLs ? I am tempted
to make a giant URL list page using the URL list produced by htdig, after
running it through uniq, and then let htdig index that, with a depth of 1.
In theory, this should do what I want (index everything and not lock
up). In practice, given the data sizes involved, I don't know. Therefore I
ask. Has anyone tried this ? Results ?
The line (URL) count of the URL file obtained so far (with grep <...db.urls
my_site|sort -r|uniq|wc -l) is of about 60,000. The file size is ~28MB.
Have I grossly exceeded htdig's limits ? ;-)
When is a built-in uniq URL feature scheduled ?
To unsubscribe from the htdig mailing list, send a message to
You will receive a message to confirm this.
This archive was generated by hypermail 2b28 : Tue Apr 25 2000 - 08:42:14 PDT