RE: [htdig] Large DB


Subject: RE: [htdig] Large DB
From: Patrick Dugal (dugal@lynx.cisti.nrc.ca)
Date: Thu Nov 04 1999 - 10:49:10 PST


Good question.

At NRC (http://www.nrc.ca/), we have indexed over 130,000 documents on more
than 50 web servers within our domain (nrc.ca). The indexing runs Pentium
II 450 with 256 megs of RAM, Linux kernel-2.0.35-1. The sum of the size of
the databases (db.docdb, db.docs.index, db.wordlist, db.words.db) is about
2.5 Gigabytes (NFS mounted). The temporary space (NFS mounted as well)
needed by htmerge for sorting is very significant (one or two gigabytes, I
think). It takes about two or three days to do an initial dig, and it takes
an afternoon to do an update dig. The search time on this relatively large
database has been very fast (a few seconds).

One of the reasons it takes so long to dig is because huge pdf files (close
to 50 megs sometimes) may take several seconds to convert to text, whereas
html files usually take much less time to parse.

Also, and this is the amazing part, we previously were doing the same size
of dig with little Pentium 100 with 48 megs of RAM with lots of swap space.
But it took about twice the amount of time with that one. The search time
on this machine has also been very fast (a few seconds).

Pat :)

-----Original Message-----
From: htdig@htdig.org [mailto:htdig@htdig.org]On Behalf Of Premier
Hosting Administrator
Sent: Thursday, November 04, 1999 12:22 PM
To: htdig@htdig.org
Subject: [htdig] Large DB

We have been playing around with HtDig, UDMSearch, Catalog, and others
recently to find out which ones can handle large volumes of searches and
the db itself...

We are looking at indexing about 300,000 websites from top to bottom for
a specialized search engine and are concerned about performance..

If the DB could fit into RAM, we'd guess that performance may not be an
issue... but without buying 1-2 Gig of RAM at a major cost we'd rather
avoid that...

The machine for indexing will be a small PII server but the machine the
public searches on will be a PIII-500 with 128 meg of RAM...

Any benchmarks for such a size of DB etc.? Any information would be
greatly appreciated..

Paul

------------------------------------
To unsubscribe from the htdig mailing list, send a message to
htdig@htdig.org containing the single word unsubscribe in
the SUBJECT of the message.

------------------------------------
To unsubscribe from the htdig mailing list, send a message to
htdig@htdig.org containing the single word unsubscribe in
the SUBJECT of the message.



This archive was generated by hypermail 2b25 : Thu Nov 04 1999 - 11:03:07 PST