Re: htdig: performance tips?

Geoff Hutchison (
Thu, 3 Dec 1998 00:00:51 -0500

>I must strongly disagree. O(n) would be quite helpful. I'd use them
>when planning hardware capacity and making time vs space tradeoffs in
>configuration. It affects decisions such as:

>* whether to leave .work files around.
>* whether to use -a, -i
>* judging how close I am to resource limits
>* deciding whether to buy more storage or more CPU power or more bandwidth
>* deciding whether to manually tell htdig about new pages
> or letting it look for them itself.

I'm not so sure. I think the constants are going to affect this
significantly. So for example, I'd say digging is always O[n] in time,
where n is the number of pages to be retrieved. But if I use incremental
indexing, n can be *much* smaller. And I might say htmerge is something
like O[n]+O[m] where n is the number of documents and m is the number of
words (after all, it goes through all of them). But this hides the fact
that htmerge sorts the words and then goes through them one-by-one and adds
them to the db.words.db. And the disk space on htmerge is probably also
O[n]+O[m], but I haven't the faintest whether it's 6*n + 8*m or 2*n + 3*m,
and this would make a big difference in your decisions, right?

Benchmarking should be more useful since you can estimate these constants.
But you'd have to think about your noise. Network traffic, other processes,
memory v. virtual memory, using local filesystem v. HTTP v. NFS. Size of
database might also play a part as trade-offs occur. I can believe on a
small database that the bookkeeping with doing update digs (marking docs as
modified, having htmerge remove the old doc, etc.) might not be worth it.

But these effects don't show up well in O[n]. Using update digs is probably
something like:
c1*n + c2*m
where n are the number of changed pages and m are the number of unchanged
pages. For this example, c2 << c1.

As for resource limits, I suggest monitoring it yourself, or running a
monitoring tool along with ht://Dig that reports memory, disk space, load,

-Geoff Hutchison
Williams Students Online

To unsubscribe from the htdig mailing list, send a message to containing the single word "unsubscribe" in
the body of the message.

This archive was generated by hypermail 2.0b3 on Sat Jan 02 1999 - 16:29:45 PST