Re: htdig: performance tips?

Jeff Breidenbach (
Thu, 3 Dec 1998 01:15:25 -0500

Hi Geoff,

I think you are underestimating my ignorance. For example...

>So for example, I'd say digging is always O[n] in time, where n is the
>number of pages to be retrieved.

Ok, I expressed n as the total number of pages, but this is a matter
of notation and I think I can convert.

>But if I use incremental indexing, n can be *much* smaller.

I interpret this as saying:
 if one page has changed, and n=total number of pages, then
 htdig bandwidth = O[constant]
 htdig time = O[constant]

That surpises me - I'd expect bandwidth to still be O[n], albeit with
a smaller constant. Glad to hear cpu time is O[constant].

>And I might say htmerge is something like O[n]+O[m] where n is the
>number of documents and m is the number of words (after all, it goes
>through all of them).

m = number of words in the page, or number of words in the database?
Probably in either case I can assume m=constant. (If n is large where
n=total number of pages.)

>But this hides the fact that htmerge sorts the words and then goes
>through them one-by-one and adds them to the db.words.db.

>From a scalability perspective, I'm not sure why I care about
implementation details as long as I understand their costs.

>And the disk space on htmerge is probably also O[n]+O[m], but I
>haven't the faintest whether it's 6*n + 8*m or 2*n + 3*m, and this
>would make a big difference in your decisions, right?

Thank god! Constants can make a difference if they dominate, but I
care immensely more that you didn't say O[n^2] or worse.

Actually, I'm not sure exactly which disk space you are referring to
here. Is this peak disk space or final disk space? Is n the total
number of documents or the total number of changed documents?

>I can believe on a small database that the bookkeeping with doing
>update digs (marking docs as modified, having htmerge remove the old
>doc, etc.) might not be worth it.

Absolutely. This only affects those of us blessed with scaling up.

>Using update digs is probably something like: c1*n + c2*m where n are
>the number of changed pages and m are the number of unchanged
>pages. For this example, c2 << c1.

In cpu time? In final disk space? In peak disk space? In bandwidth?

>As for resource limits, I suggest monitoring it yourself, or running a
>monitoring tool along with ht://Dig that reports memory, disk space,
>load, etc.

I do. But I'd love to know that something is O[n^2] and will dominate
my resource costs when things get 10x or 100x bigger. (So I can change
my system design to focus on fighting that one terrible thing.)


To unsubscribe from the htdig mailing list, send a message to containing the single word "unsubscribe" in
the body of the message.

This archive was generated by hypermail 2.0b3 on Sat Jan 02 1999 - 16:29:45 PST