Re: htdig: performance tips?


Maren S. Leizaola (leizaola@unitedmta.com)
Thu, 3 Dec 1998 13:10:07 +0800 (CST)


On Wed, 2 Dec 1998, Jeff Breidenbach wrote:

>
> Thank you for the suggestion about the .work files. I'll confirm my
> Apache 1.3.1 supports the If-Modified-Since header and is configured
> correctly in that respect.
>
> >But I've got a DB of about 50,000, so only indexing daily changes (a
> >few hundred pages maybe) is a big speedup.
>
> I've got about 130,000 pages very unevenly split over about 180 htdig
> databases and am growing by about 1000 pages per day, and may grow
> much faster soon. I didn't realize that removing the .work files was
> killing incremental indexing - I'll check how it helps during a full
> scale test.
>

Unfortunately the incremental indexing only works on sites which know how
to archive their data or have a good roll daily udpated pages into the
archives.

> > [...] but benchmarks will vary considerably, and O(n) isn't useful.
>
> I must strongly disagree. O(n) would be quite helpful. I'd use them
> when planning hardware capacity and making time vs space tradeoffs in
> configuration. It affects decisions such as:
>
> * whether to leave .work files around.
> * whether to use -a, -i
> * judging how close I am to resource limits
> * deciding whether to buy more storage or more CPU power or more bandwidth
> * deciding whether to manually tell htdig about new pages
> or letting it look for them itself.
>
> These decisions would be much easier if I could benchmark my current
> setup's peformance and had O[n] performance estimates. It's the
> difference between guessing at bottlenecks and making more intelligent
> decisions.
>
> I set up a table below which would probably be enough to answer just
> about any performance / scaling question. I filled the very few values
> that seem obvious to me, and would love to know the rest. Any takers?
>

Jeff, you are also forgetting a few factors which really have to be
throun into the equation on how :

- many parallel copies of HTDig you are running (which affects inbound
bandwidth consumption).
- What kind of sites you are indexing and whether they are proxy friendly.
- How much bandwidth the remote site has.

If you are also measuring your outbound performance

What kind of queries they are, are they worse case searches or are they
simple uniq quieries.
What is the average number of disk transations per search.
What is the maximum number of disk transations you disk subsystem will
handle.
How many CPU cycles does a search take, in simple or worse case scenarios.
etc...

Then work out the bottle neck on your machine, CPU, Disk I/O or bandwidth.

Once you have the max number of searches you can take per second then you
calculate multiply that by the number of bytes a page full of results
takes for worse case.

In our case we are yet to max out, but I think it will be disk IO when it
happens. In our case multiple CPU's do not help much...

Maren.

> Jeff
>
>
> Performance estimates, htdig
> n = total number of messages indexed
>
> ============================================================================
> Time RAM Disk Disk Bandwidth
> (final) (peak)
> ----------------------------------------------------------------------------
> Initial indexing
> ----------------------------------------------------------------------------
> htdig ??? ??? ??? ??? O[n]
> ------------------------------------------------------------------------
> htmerge ??? ??? ??? ??? 0
> ----------------------------------------------------------------------------
> Second indexing (no changes to data)
> ----------------------------------------------------------------------------
> htdig ??? ??? ??? ??? ???
> ------------------------------------------------------------------------
> htmerge ??? ??? ??? ??? 0
> ----------------------------------------------------------------------------
> Third indexing (one piece of data has changed)
> ----------------------------------------------------------------------------
> htdig ??? ??? ??? ??? ???
> ------------------------------------------------------------------------
> htmerge ??? ??? ??? ??? 0
> ----------------------------------------------------------------------------
> Fourth indexing (all data has changed)
> ----------------------------------------------------------------------------
> htdig ??? ??? ??? ??? ???
> ------------------------------------------------------------------------
> htmerge ??? ??? ??? ??? 0
> ----------------------------------------------------------------------------
>
>
> Additional Notes:
> Data above assume not use of -i or -a
>
> -i will make everything perform like initial indexing
>
> -a will double final disk requirements, plus add O[n] time
> for copying .work files. (probably with a low constant)
> ----------------------------------------------------------------------
> To unsubscribe from the htdig mailing list, send a message to
> htdig-request@sdsu.edu containing the single word "unsubscribe" in
> the body of the message.
>

----------------------------------------------------------------------
To unsubscribe from the htdig mailing list, send a message to
htdig-request@sdsu.edu containing the single word "unsubscribe" in
the body of the message.



This archive was generated by hypermail 2.0b3 on Sat Jan 02 1999 - 16:29:45 PST