RE: [htdig] indexing

Subject: RE: [htdig] indexing
From: David Schwartz (
Date: Fri Jan 07 2000 - 11:43:50 PST

> At 11:01 AM -0800 1/7/00, David Schwartz wrote:
> > The 'htdig' process consumes more and more memory as it runs.
> >This might be
> >due to memory leaks, or it might be legimitately due to it
> > keeping track of
> >all the URLs it has to process. I tried htdigging 250,000
> > documents and hit
> >about 180Mb.

> At this point (3.1.4), there do not seem to be memory leaks
> left--obviously if someone finds any with Purify, we'd fix them.

        That's kind of puzzling then. I did an htdig on about 250,000 documents
with 3.1.4 (under Linux) and it gradually grew over time to about 180Mb. Now
I don't know how htdig stores URLs internally, but the average URL length
was about 80 characters. 80x250,000 is about 20Mb.

        If it really is the URLs eating memory, perhaps we need a patch to allow
the URLs to be swept to be stored in a different way (perhaps each depth
should write the URLs for the next greater 'depth' into a file?). It'd be
very convenient for me to be able to dig 400,000 URLs in a pass.

        If it's not the URLs, what is it?


To unsubscribe from the htdig mailing list, send a message to
You will receive a message to confirm this.

This archive was generated by hypermail 2b28 : Fri Jan 07 2000 - 11:59:19 PST