[htdig] htdig / Suse 6.2: very long run ?

Subject: [htdig] htdig / Suse 6.2: very long run ?
From: Peter L. Peres (plp@actcom.co.il)
Date: Tue Apr 25 2000 - 11:57:19 PDT


  I'm new to this list, and I have a question. I have a SuSe 6.2 box that
I am very happy with, excepting for htdig. I have run htdig as advised
using the original setup, and it has indexed everything in the Suse HTML
system, so I know it works.
  Then, I added my own HTML and PDF docs to the site, and things stopped
being ok.
  Problem: htdig has been running for 24+ hours (i486/100MHz, 24MB RAM,
lots of disk space). The data to be indexed is not larger than 80MB.
  I have run the htindex command several times so far (interrupted in the
middle etc). The last time(s) I generate a URL and image list.
  This list was looked at using a command like:

grep <...db.urls http://myhost.here|sort -r|uniq -d|less

  to find lots of duplicates. I believe that the htdig has locked itself
in a loop of URLs, although I have almost no cross-indexing (Suse HTML
pages do have plenty of that however).

  So, what can I do ?

  Is there a way to do the initial dig using a list of URLs ? I am tempted
to make a giant URL list page using the URL list produced by htdig, after
running it through uniq, and then let htdig index that, with a depth of 1.

  In theory, this should do what I want (index everything and not lock
up). In practice, given the data sizes involved, I don't know. Therefore I
ask. Has anyone tried this ? Results ?

  The line (URL) count of the URL file obtained so far (with grep <...db.urls
my_site|sort -r|uniq|wc -l) is of about 60,000. The file size is ~28MB.

  Have I grossly exceeded htdig's limits ? ;-)

  When is a built-in uniq URL feature scheduled ?



To unsubscribe from the htdig mailing list, send a message to
You will receive a message to confirm this.

This archive was generated by hypermail 2b28 : Tue Apr 25 2000 - 08:42:14 PDT