Re: [htdig3-dev] htdig-3.2.0b1 slower than htdig-3.1.3 ?

Subject: Re: [htdig3-dev] htdig-3.2.0b1 slower than htdig-3.1.3 ?
From: Geoff Hutchison (
Date: Tue Feb 22 2000 - 14:19:45 PST

At 12:14 PM -0500 2/22/00, Walter Addison March wrote:
>When we run the 3.1.3 thus: htdig -i -l -t it runs from Wed Feb 16
>11:38:58 EST 2000 until Wed Feb 16 13:05:48 EST 2000.
>When I run the 3.2.0b1 thus: htdig -i -t -a -v well... it started at 9am
>and it still isn't done 3 hours later.
>The 3.2 htdig actually should be finding even fewer urls to follow (the
>limit_urls_to list for the 3.2 is shorter than the 3.1.3) and pages to
>dig... any ideas on why it is already taking twice as long and it isn't
>near done?

I would not be surprised if for many people 3.2.0b1 is slower than
3.1.x versions. First off, it's essentially doing the work of htdig
and htmerge in one step--you don't need to do any sorting in 3.2. For
right now, you'll still want to run htmerge though--it weeds out
bogus URLs and so on.

Secondly, the indexing in the 3.2 code is a bit more I/O intensive.
For one, the word database will probably come out a bit larger
because it's storing every single word, rather than a record for
every document that has a certain word. For another, it splits the
excerpts out into another database, which means it's writing to a few
files at once.

Finally, we haven't made much effort to optimize for speed--I think
it can be faster, but without some feedback, it's hard to know what
the slow parts are.

In short, don't worry, but any feedback as to performance is most welcome.

To unsubscribe from the htdig3-dev mailing list, send a message to
You will receive a message to confirm this.

This archive was generated by hypermail 2b28 : Tue Feb 22 2000 - 14:24:41 PST