Subject: Re: [htdig3-dev] Creating a SQL backend...
From: Bill Carlson (firstname.lastname@example.org)
Date: Wed Jun 28 2000 - 08:09:17 PDT
On Wed, 28 Jun 2000, tomi wrote:
> Dear Torsten,
> I accept all the replies (I should be crazy not to do it.. :)). But I would
> like to remember youagain the possibilities of the SQL based database. It
> permits, infact, basing on the engine of many databases as Oracle or
> Postgres (as said previously), creating RDBS in the way to store all the
> datas in a single clustered database, thing that, as you easily could
> imagine, would give a University, even more a great company site the
The purpose of a database is to organize information. For the purposes of
ht://Dig, the current format works very well. It can easily be "clustered"
by various methods at other levels of the OS. SQL at this point does not
lend itself to clustering other than high end and expensive solutions, at
least not in a true cluster sense.
What is really needed (and has been discussed before) are tools to "dump
and reload" the ht://Dig databases, so that the indexes can be easily
modified in ways other than index/merge. Then you get the best of both
worlds, a fast search engine and a data repository to do things like
whats-new scripts and things of that nature.
> capability to create a complete index of all their documents.
> I launched last month my crawler in my university to discover the number of
> documents presents and give a statistic. There were about 20,000,000 of html
> and hypertextual documents in all the servers (this would make me assume
> they' d be much more, because a great deal of the non reached where in .ps,
> .pdf, or other format non followble).
> I did not test ht://Dig working to index this great ammount of datas, but
> everything let me thing that BerkeleyDB is not the appropriate way to do it.
I think you are incorrect, if anything the current databases would be
best, as they are fast and compress as much information as possible.
Granted, if one has Oracle laying around and enough hardware to support
it, that might seem like a better solution, but I'd still bet on the
> Another way could be to parallelize the storing and search routine, to put,
> for example, 50 BerkeleyDBs in 50 differents machine, clustered with Beowulf
> system, that would work by rsh in serach method...
You don't even have to get that fancy. Take 50 $1500 x86 boxes with enough
shared storage and create a web farm out of them. I'll bet your databases
would fit on each machine (are you really talking 20 million documents
or is that a tongue in cheek phrase?) and you won't be crawling more than
once a week anyway. And 50 machines would serve a quite a few requests.
At our site, we get around 8 million hits a month, with around a quarter
million searches. Our databases have 13,000 documents and are about 200MB.
We serve that off 2 older Sun E450s with 384 MB of RAM, they don't even
notice the load that places on them. I'm guessing your 20,000,000
documents would have anywhere from 80GB to 110GB databases, throw them on
a couple of x86s with 1GB of RAM and RAID0 over 4 40GB IDE drives (see
www.3ware.com for cards that will do that!) you could serve several
hundred thousand searches a month for a cost under $10,000. The equivalent
SQL database would be immense.
> The problem here that comes is: htdig create different databases that
> htmerge merges, eliminating the identical documents... this is a good
> process, but as said make the parallelizing unrealizeble...
It seems like you don't follow the flow of ht://Dig. The index/merge
creates the databases against which searches are performed. One can be
indexing to create one set of databases while another is used to continue
to perform searches, the time to index/merge is limited more by network
bandwidth than horsepower on the machine. In the search farm setup, one
machine would be doing the index/merge and when complete push the updated
databases out to the rest of the farm.
SQL is nice, but it is not the end all be all.
Systems Programmer email@example.com | Opinions are mine,
Virtual Hospital http://www.vh.org/ | not my employer's.
University of Iowa Hospitals and Clinics |
To unsubscribe from the htdig3-dev mailing list, send a message to
You will receive a message to confirm this.
This archive was generated by hypermail 2b28 : Wed Jun 28 2000 - 05:24:04 PDT