[htdig] Suse 6.2 + htdig 3.1.5: looping again


Subject: [htdig] Suse 6.2 + htdig 3.1.5: looping again
From: Peter L. Peres (plp@actcom.co.il)
Date: Mon May 01 2000 - 14:50:17 PDT


Hi,

  I have run htdig with various options and have settled down to a
compression of 3 -l and a few other things. Performance and database size
are much better than before, but it still tries to re-index the whole site
after a while, i.e. it loops.
  The site is served by Apache, and contains html, pdf, txt, and other
files, incluuding directory indexes. Adding D=M? ... etc to the bad urls's
list seemed to help, yet the htdig program loops again.
  Is there no way to keep a list of seen URLs in htdig and avoid them ?
That would be a first step towards avoiding looping. Surely there is a way
to do this. In theory, it should build a map of the site as a tree and
then walk it, refusing to 'climb' any branches. Yes I did add .. to the
list of bad urls to avoid its going up to directory index parents. This
seems to work.
  I make some simple assumptions here, but if only a hash table of visited
URLs is stored in memory, to avoid having to open the db file of urls
every time, then this would not slow things down too much ?
  To the best of my knowledge, Altavista used a 2-tiered engine, with a
gatherer (of URLs) and an indexer that follows the URL list built by the
gatherer.
  Is this or will it ever be a part of htdig ?

thanks,

        Peter

------------------------------------
To unsubscribe from the htdig mailing list, send a message to
htdig-unsubscribe@htdig.org
You will receive a message to confirm this.



This archive was generated by hypermail 2b28 : Mon May 01 2000 - 12:30:28 PDT