Jeff Breidenbach (email@example.com)
Thu, 3 Dec 1998 01:15:25 -0500
I think you are underestimating my ignorance. For example...
>So for example, I'd say digging is always O[n] in time, where n is the
>number of pages to be retrieved.
Ok, I expressed n as the total number of pages, but this is a matter
of notation and I think I can convert.
>But if I use incremental indexing, n can be *much* smaller.
I interpret this as saying:
if one page has changed, and n=total number of pages, then
htdig bandwidth = O[constant]
htdig time = O[constant]
That surpises me - I'd expect bandwidth to still be O[n], albeit with
a smaller constant. Glad to hear cpu time is O[constant].
>And I might say htmerge is something like O[n]+O[m] where n is the
>number of documents and m is the number of words (after all, it goes
>through all of them).
m = number of words in the page, or number of words in the database?
Probably in either case I can assume m=constant. (If n is large where
n=total number of pages.)
>But this hides the fact that htmerge sorts the words and then goes
>through them one-by-one and adds them to the db.words.db.
>From a scalability perspective, I'm not sure why I care about
implementation details as long as I understand their costs.
>And the disk space on htmerge is probably also O[n]+O[m], but I
>haven't the faintest whether it's 6*n + 8*m or 2*n + 3*m, and this
>would make a big difference in your decisions, right?
Thank god! Constants can make a difference if they dominate, but I
care immensely more that you didn't say O[n^2] or worse.
Actually, I'm not sure exactly which disk space you are referring to
here. Is this peak disk space or final disk space? Is n the total
number of documents or the total number of changed documents?
>I can believe on a small database that the bookkeeping with doing
>update digs (marking docs as modified, having htmerge remove the old
>doc, etc.) might not be worth it.
Absolutely. This only affects those of us blessed with scaling up.
>Using update digs is probably something like: c1*n + c2*m where n are
>the number of changed pages and m are the number of unchanged
>pages. For this example, c2 << c1.
In cpu time? In final disk space? In peak disk space? In bandwidth?
>As for resource limits, I suggest monitoring it yourself, or running a
>monitoring tool along with ht://Dig that reports memory, disk space,
I do. But I'd love to know that something is O[n^2] and will dominate
my resource costs when things get 10x or 100x bigger. (So I can change
my system design to focus on fighting that one terrible thing.)
To unsubscribe from the htdig mailing list, send a message to
firstname.lastname@example.org containing the single word "unsubscribe" in
the body of the message.
This archive was generated by hypermail 2.0b3 on Sat Jan 02 1999 - 16:29:45 PST