Re: [htdig] ht://Dig 3.2.0b1 and 3.2.0b2-022000 Extremely Slooooow


Subject: Re: [htdig] ht://Dig 3.2.0b1 and 3.2.0b2-022000 Extremely Slooooow
From: J Kinsley (jkinsley@horus.bticc.net)
Date: Wed Feb 23 2000 - 14:20:03 PST


On Wed, 23 Feb 2000, Geoff Hutchison wrote:

> Date: Wed, 23 Feb 2000 08:31:22 -0600
> From: Geoff Hutchison <ghutchis@wso.williams.edu>
> To: J Kinsley <jkinsley@horus.bticc.net>
> Cc: htdig@htdig.org
> Subject: Re: [htdig] ht://Dig 3.2.0b1 and 3.2.0b2-022000 Extremely Slooooow

Ok, I shall attempt to provide some hard numbers here showing the
index speed difference between 3.2.0b2-022000 and 3.1.2. First
though I will clear up the rpm confusion. When I installed the beta
series, I used RPM to build and install it. However, last spring
when I first installed 3.1.2, I did not use RPM and the binaries went
into /opt/www/bin. When installing the beta rpm, I moved the binary
location to /opt/www/sbin and since 3.1.2 was manually installed, RPM
did not remove those binaries. The first time I built the index two
days ago, I called htdig from the command line and the 3.1.2 binaries
were used instead of the betas. I did not realize this until trying
to determine why htsearch (3.1.2 version was overwritten by beta
version) failed to recognize the database. Although I had previously
installed ht://Dig, I had never used it due to disk space
limitations.

Anyway, on with the numbers....

Server:
        Intel PII 233MHz
        64MB SDRAM
        Kernel 2.2.14
        Customized RedHat 6.0-6.2
        Apache 1.3.6

Archive:
        44101 Files - 1290 Directories
                Smallest: 190 B
                Largest: 9.40 MB
                Average: 30.20 KB
                Total: 1.35 GB

NOTE: ht://Dig is running on the same physical host as the web server
it indexing, so network bandwidth is not a factor here.

ht://Dig version: 3.1.2

        htdig -l -s -v -c /etc/www/htdig/bti.conf > /tmp/htdig.log 2>&3

        Index time: 01:52:00
        Index size: 634MB wordlist
                     325MB documents
        URL's indexed according to /tmp/htdig.log: 52100
        (number higher than total due to indexing ?[MNSD]=[AD] for
        each directory

        CPU time: 00:39:00
        RSS: unknown

        htmerge -c /etc/www/htdig/bti.conf
        Merge time: 00:42:00
        Index size: 504MB wordlist.db
        CPU time: unknown
        RSS: unknown

        Note: the above numbers are from my memory and thus are
        close approximations.

ht://Dig version: 3.2.0b1

        htdig -l -s -v -c /etc/www/htdig/bti.conf > /tmp/htdig.log 2>&3

        Exited after 3 hours / ~2200 files to attempt to speed up

ht://Dig version: 3.2.0b2-022000

        htdig -l -s -v -c /etc/www/htdig/bti.conf > /tmp/htdig.log 2>&3

        Index time:
                Start: Feb 23 05:07:18 EST 2000
                Current: Feb 23 16:38:25 EST 2000
                Est. End: Feb 24 10:00:00 EST 2000
        URL's processed according to /tmp/htdig.log: 19111
        CPU time: 00:52:42
        RSS: 31MB

<snip>

> Now, as far as the speed of indexing in 3.2.0b1 (and current
> snapshots), I probably need to make this a FAQ. Right now, it's
> probably not going to be faster than 3.1.x versions and is quite
> likely to be slow. We rewrote the whole layout of databases and in
> the process made quite a few trade-offs against the indexer.

Using my estimated end time above, we're looking at a 27 hour
increase in index time on ~50,000 URL's. I do not think this is you
mean by 'a few trade-offs', so I am guessing it is a bug. Although I
do not fully understand how to detect memory leaks, I suspect that is
the problem. When I first start htdig, it indexes the first 1000
URL's in about 6 minutes and the RSS creeps up to around 18-19MB and
it starts to slow down.

<snip>

> But the important thing to remember is that these are *betas*--we're
> looking for feedback. We'd love to have accurate performance and
> requirement feedback. The new database layout is probably going to
> require more disk space (especially if compression is off), but you
> won't need as much memory for htmerge. So hard numbers would be
> wonderful. This will help us target what needs improvement. Further,
> if anyone wants to help improve indexing performance, I'm sure we can
> come up with a list.

Ht:/Dig is just one of many bleeding edge packages I currently
have installed, so I'll do what I can to help solve the problems.

J. Kinsley

------------------------------------
To unsubscribe from the htdig mailing list, send a message to
htdig-unsubscribe@htdig.org
You will receive a message to confirm this.



This archive was generated by hypermail 2b28 : Wed Feb 23 2000 - 14:29:40 PST