Re: htdig: Digging throug HTTP and wait between documents


Gilles Detillieux (grdetil@scrc.umanitoba.ca)
Tue, 8 Dec 1998 11:35:55 -0600 (CST)


According to Klaus Mueller:
> is it possible to set a small wait time between digging two documents from
> one server to prevent from server overload?

The retriever code seems to try to spread the load among the servers it
accesses, but there doesn't seem to be anything to prevent rapid fire
against a single server, if you're indexing only one or two servers.

A quick fix would be to add a sleep() call just before c.connect()
in Document::RetrieveHTTP() (file htdig/Document.cc). That would slow
the whole dig down, whether it's accessing the same server repeatedly,
or interleaving its requests.

A proper fix would involve keeping track of the time each host was
accessed last, and before any access to a host, if the last access was
more recent than the number of seconds in some new config parameter,
then it would sleep for the difference in time. By recording the time
at each c.close(), and checking it before c.connect(), it would ensure
a minimum idle time between each connection.

If you want to get really fancy, you could make the amount of delay
dependent on the URL you're accessing. E.g. you may want a bigger
delay for something with .cgi or /cgi-bin/, if you're indexing these,
than you'd use for .html files.

-- 
Gilles R. Detillieux              E-mail: <grdetil@scrc.umanitoba.ca>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:    (204)789-3930
----------------------------------------------------------------------
To unsubscribe from the htdig mailing list, send a message to
htdig-request@sdsu.edu containing the single word "unsubscribe" in
the body of the message.



This archive was generated by hypermail 2.0b3 on Sat Jan 02 1999 - 16:29:49 PST