Server Pool


MSQL_User (MSQL_User@st.hhs.nl)
Wed, 24 Jun 1998 17:21:50 +0200 (MET DST)


Hello,

last night I was thinking of htdig and the server pool. htdig 3 has
a server pool that's filled with new servers and a round robin
scheduling is used to traverse through the web tree. How to do
in dig 4? Maybe threads come in place here.

here is my idea.

Some object reads the $start_url string and creates a thread for
each server. A thread is reposible for indexing that server. if a
new server is found by a thread, it is reported, and a new thread
is started. If there are too much threads around, the reported
new server is kept in the pool till a thread has completed his
current server digging (the thread that has completed his current
server doesn't has to die). If there are no new waiting servers around,
the completed thread is stopped (dies). What's the gain? Well,
fast (probably local) server respond fast and are indexed fast (the
load can be reduced by inserting a sleep() after each indexed document).
Slow servers take longer time to respond and are indexed slower.
Hopefully Java knows what to do with (internet) waiting threads.
As far as I know, dig 3 just waits for each document to be retrieved
before it continues. The overall gain is that digging is faster as
waiting threads are blocked and runnable threads become running.
Like it?

Now the what-if's?

What if $start_url has two entries with the same server? One is
started and the other is put in the pool marked "already on a thread".
If a thread has completed, and it has the same servername, the one
put in the pool is given to the thread. If another thread (with a
different server name) becomes free, and there is still a thread
running with the same server name as the one marked "already on a thread",
another name is taken from the pool, and if not available, dies.

What if a server bumps into a server name that's being indexed on
another thread. it's put in the pool and marked "already on a thread".
Just like above.

How to build the datbases? I have no idea. Hopefully the database engine
knows how to deal with multiple connections reading/writing on one
database.

Output on the screen (htdig -v)? yeah, that's a nice one. I'll think
about it. Is should be something like A docid is given to a document
(being indexed by a thread), only when the parsed info is written to
the database, not before that.

see you,

--jesse
---------------------------------------------------------------------
J. op den Brouw Johanna Westerdijkplein 75
Haagse Hogeschool 2521 EN DEN HAAG
Sector Techniek Netherlands
Afdeling Elektrotechniek +31 70 4458936
-------------------------- msql@st.hhs.nl ---------------------------

htdig survey: http://crytonII.st.hhs.nl/htdig/survey.html



This archive was generated by hypermail 2.0b3 on Sat Jan 02 1999 - 16:26:35 PST