Re: htdig: performance tips?


Jeff Breidenbach (jeff@jab.org)
Wed, 2 Dec 1998 23:45:35 -0500


Thank you for the suggestion about the .work files. I'll confirm my
Apache 1.3.1 supports the If-Modified-Since header and is configured
correctly in that respect.

>But I've got a DB of about 50,000, so only indexing daily changes (a
>few hundred pages maybe) is a big speedup.

I've got about 130,000 pages very unevenly split over about 180 htdig
databases and am growing by about 1000 pages per day, and may grow
much faster soon. I didn't realize that removing the .work files was
killing incremental indexing - I'll check how it helps during a full
scale test.

> [...] but benchmarks will vary considerably, and O(n) isn't useful.

I must strongly disagree. O(n) would be quite helpful. I'd use them
when planning hardware capacity and making time vs space tradeoffs in
configuration. It affects decisions such as:

* whether to leave .work files around.
* whether to use -a, -i
* judging how close I am to resource limits
* deciding whether to buy more storage or more CPU power or more bandwidth
* deciding whether to manually tell htdig about new pages
  or letting it look for them itself.

These decisions would be much easier if I could benchmark my current
setup's peformance and had O[n] performance estimates. It's the
difference between guessing at bottlenecks and making more intelligent
decisions.

I set up a table below which would probably be enough to answer just
about any performance / scaling question. I filled the very few values
that seem obvious to me, and would love to know the rest. Any takers?

Jeff

                     Performance estimates, htdig
                 n = total number of messages indexed

============================================================================
               Time RAM Disk Disk Bandwidth
                                   (final) (peak)
----------------------------------------------------------------------------
Initial indexing
----------------------------------------------------------------------------
    htdig ??? ??? ??? ??? O[n]
    ------------------------------------------------------------------------
    htmerge ??? ??? ??? ??? 0
----------------------------------------------------------------------------
Second indexing (no changes to data)
----------------------------------------------------------------------------
    htdig ??? ??? ??? ??? ???
    ------------------------------------------------------------------------
    htmerge ??? ??? ??? ??? 0
----------------------------------------------------------------------------
Third indexing (one piece of data has changed)
----------------------------------------------------------------------------
    htdig ??? ??? ??? ??? ???
    ------------------------------------------------------------------------
    htmerge ??? ??? ??? ??? 0
----------------------------------------------------------------------------
Fourth indexing (all data has changed)
----------------------------------------------------------------------------
    htdig ??? ??? ??? ??? ???
    ------------------------------------------------------------------------
    htmerge ??? ??? ??? ??? 0
----------------------------------------------------------------------------

Additional Notes:
  Data above assume not use of -i or -a

 -i will make everything perform like initial indexing

 -a will double final disk requirements, plus add O[n] time
 for copying .work files. (probably with a low constant)
----------------------------------------------------------------------
To unsubscribe from the htdig mailing list, send a message to
htdig-request@sdsu.edu containing the single word "unsubscribe" in
the body of the message.



This archive was generated by hypermail 2.0b3 on Sat Jan 02 1999 - 16:29:45 PST