Re: [htdig] suggestions for large multi-server indexing?


Subject: Re: [htdig] suggestions for large multi-server indexing?
From: Geoff Hutchison (ghutchis@wso.williams.edu)
Date: Fri Jun 09 2000 - 13:10:31 PDT


On Fri, 9 Jun 2000, Albert Lunde wrote:

> I'd like to hear your suggestions for doing large-scale multi-server
> indexing with htdig.

Heck, if you want to discuss this in person, I'm a few blocks away. :-)

> (1) What are the are the pros and cons of doing a single big index
> (giving it starting URLs across all servers) vs. doing a number of
> small indexes and merging them?

The biggest pro of doing small indexes is that the digging can occur in
parallel. When I wrote the merging code, I called it "Poor-Man's
Multithreading."

The biggest con of doing small indexes is that you'll have multiple copies
of your data, plus some space overhead for having distinct files.

My personal feeling is that if you have some bandwidth and a few spare
computers with the same configuration, build the initial index in small
indexes and then merge them together. Since the number of changes is
normally very small, I then just update the big one and delete the small
ones. See below for some additional comments.

> (2) What are issues likely to cause problems in scaling up?

The usual sorts of things, RAM, disk, time. ;-) More seriously, these do
not scale linearly. Time is probably something like O(nlog n) right now,
and disk is probably somewhere in between. If you're using 3.1.x, I don't
know how to estimate RAM consumption--the htmerge sorting phase seems to
be problematic for many people truly huge databases.

> (3) How large are some indexes that people have created sucessfully,
> and what hardware/time does it take to do it?

I know a few people over the 1 million URL mark. The wso.williams.edu
search server has 128MB RAM and hosts something like 95,000 URLs.

> The case I'm interested in is creating a campus-wide index of the
> semi-official servers at our university.
>
> No one knows exactly how much is out there to index, but rough
> guesses suggest 200-300 servers, with something like 100,000 -
> 200,000 HTML pages.

A large part of indexing time for multiserver indexing is network latency.
I would suggest starting with just a few that you expect will be large
(e.g. www.nwu.edu, which I believe already has a search engine) and slowly
start to merge in some of the rest.

> (I've been following this list for a bit, but haven't been able to
> get far with experients on my own due to difficulties building the
> software on HP-UX 10.20 and lack of time.)

HP-UX is only semi-supported. People periodically send me patches, but I
think things break because none of active developers have direct access to
one. It also varies considerably by compiler.

--
-Geoff Hutchison
Williams Students Online
http://wso.williams.edu/

------------------------------------ To unsubscribe from the htdig mailing list, send a message to htdig-unsubscribe@htdig.org You will receive a message to confirm this.



This archive was generated by hypermail 2b28 : Fri Jun 09 2000 - 11:00:56 PDT