Re: [htdig] performance


Subject: Re: [htdig] performance
From: Zoran Constantinescu-Fulop (Zoran.Constantinescu-Fulop@idi.ntnu.no)
Date: Mon Apr 10 2000 - 10:05:51 PDT


Hi!

> >anyone have some performance # of the default htdig vs the one with
> >the mysql patch?
> I would guess that since no one has spoken up, the answer is no.
>
I don't have any numbers, just some tests I made. For a small
number of web pages indexed (thousands), the htdig with file
storage is faster for indexing. The problem is when you have
some hundreds of thousands of web pages to be indexed. Then
the indexing procedure takes a _very_ long time (kind of an
exponential growth). You can 'feel' this especially when you
have a slow machine (P133 w/16MB RAM :). I tried to index with
the old htdig about 10.000 pages (aprox. 100 MB data), but I had
to give up after 2 hours: the most active part of the computer
was the hard-disk :-).

The mysql version deals better with such large amount of data.
I personaly tried with about 300.000 web pages indexed and it
is faster (more than 1 GB data). Of course, this depends on a
lot of things (size of web pages, number of words, etc.).

> Since the patch doesn't explicitly use any advanced SQL queries, my
> guess is that it won't change things substantially. However, I
> haven't tried it, so I obviously can't say. :-)
>
As you say, the SQL queries are not at all advanced, I can say
rather 'unoptimized'. However, it can be seen an improvement
in the speed over the gdbm-htdig. The sql version is using (by
default) the SQL-server's caching and indexing.

> Will it change functionality in any way? Well, it will require you to
> have a working SQL server up. Some people already have one running,
> so that's no problem. Other people don't, so it would be one more
> requirement.
>
I tried to make the patch as easy as possible to install and use
the MySQL server (or others) even for a non-sql user.

Of course the patch is not providing _all_ the functionality of
the gdbm version of htdig. I still have some work to do on the
htfuzzy (it works now only with the 'endings' algorithm), and
probably some bugs are still around :-) And, of course, only
with 3.1.5.

There are also some advantages of the SQL version. You can
keep the database and just update some of the web pages. To
delete one of the web pages from the database is just an
easy sql command. I don't know how it is in the gdbm-htdig.
(I have to admit :-( that this feature is not completely
implemented now in the patch ). And I think there could be
some other advantages using sql.

The problem is that you _have to_ keep up and running a
SQL server. From my experience, I can say that MySQL is a
very good sql server. I didn't have any problems with it.

Cheers,
--zoran

-------------------------------
Zoran Constantinescu -o)
zoran@idi.ntnu.no /\\
http://www.idi.ntnu.no/~zoran _\_v
tel:+47 977 11 574
-------------------------------

------------------------------------
To unsubscribe from the htdig mailing list, send a message to
htdig-unsubscribe@htdig.org
You will receive a message to confirm this.



This archive was generated by hypermail 2b28 : Mon Apr 10 2000 - 07:50:50 PDT