Subject: [htdig] One solution for slow dig on Linux.
From: Sean Pecor (firstname.lastname@example.org)
Date: Mon Dec 20 1999 - 21:21:58 PST
I've recently had some success tackling some problems that I had created for
myself, and my guess is that one or two of you may benefit from my suffering
<grin>. I had a fairly large development project that required a search
engine capable of handling approximately 150,000 internal pages and 10,000+
pages on thousands of external web servers. Despite being warned by the FAQ
that htdig wasn't built for this task, I couldn't resist giving it a go;
especially since the source code for the entire engine was in C and was
My first problem was that I was encapsulating the output of htsearch within
my own cgi engine, a group collaboration tool written entirely in C, that
was driving this particular online community. So I had to comment out the
output of the "content-type=text/html" string in the relevant locations
within the htsearch source (Display.cc?). Whew, that was easy.
My second problem was related to the first. Since my cgi engine needed its
own query string to work its magic, and indirectly controlled htsearch, I
had to modify the htsearch source so that it didn't detect the presence of
the query string (I renamed REQUEST_METHOD in htlib/cgi.cc). In this manner,
I could then pass the goods directly as arguments to an external call to
htsearch during the execution of my cgi (i.e. htsearch -c /my/custom.conf
"page=2&words=woah&cmd=command&searchtype=mysearch"). I then had to modify
the portion of Display.cc that built the hrefs to the next, previous and
page number links so that my own special query string name/value pairs were
piggy-backed. Whew, that was pretty easy too. After whipping up my own set
of html templates for htsearch (simple ones really, since the interface
framework was actually being provided by my group collaboration engine) I
was ready to start the real fun stuff.
My third problem was figuring out how to hand 150,000 unique urls to htdig
without having it spider the web site to find them. I already knew the urls
I wanted it to index and, in fact, I didn't want it to go any further than
the urls I specified. Luckily it lets you hand it a file of proper urls to
get it going, so I wrote a program to create the list file for me. I then
specified a maximum hop count of zero in the htdig configuration file. Lest
some of you disbelievers think that htdig can't handle the big stuff, I can
testify that it did just fine when I force fed it a fourteen megabyte text
file containing not less than one hundred forty nine thousand, nine hundred
seventy four unique urls! It swallowed them up in well under four hours on a
puny little 256meg PII-400 running Linux Redhat 5.0 and Apache using an
insignificant amount of CPU time.
My fourth problem was the 10,000+ pages on 2,000+ external web servers.
Again, I only wanted the pages I specified so I built a program to create
the list and then I force fed it to htdig. This wasn't the problem. The
problem was that htdig would go to sleep during the indexing process and,
seemingly, never wake up. I ran it in debug mode and saw that it would
eventually hit a web page on a new server and stall. And stall. And stall.
After about twenty or thirty minutes htdig would finally timeout and
continue until it eventually hit another web page on a dead server and stall
again. When you're dealing with 2,000 web servers there is bound to be
dozens of dead machines (likely in direct correlation with the number of NT
servers. Heh heh). I tried without luck to find an explanation of why the
"timeout: 20" (seconds) in my htdig configuration file was being translated
as 30 minutes. I spent an entire day researching on the Net to uncover
possible causes. The author of htdig indicated in an earlier mailing digest
post that he couldn't recreate the problem, and wasn't sure why the timeout
setting wasn't working for some people. This was NOT encouraging, but I'm
stubborn, so I kept hammering away. During a stall, I found that netstat
indicated that the htdig process owned a socket connection stuck in SYN_SENT
mode. I went searching for info on that (I'm not a IP guru) and found some
Linux kernel tweaking notes. I peaked at the value contained within my
/proc/sys/net/ipv4/tcp_syn_retries file and found "10". I peaked at the
value in my /proc/sys/net/ipv4/tcp_fin_timeout file and found "180" seconds.
Using my superior math skills (heh) I determined that 10 retries at 180
seconds each is 30 minutes, which was pretty close to how long each htdig
stall was. So I crossed my fingers and changed the timeout to 30 seconds and
the number of retries to 2. Voila! The htdig index process still stalled,
but each stall didn't take but a minute or so and the entire index was built
I'd be interested in comments from any IP / Linux gurus regarding my
tcp_fin_timeout / tcp_syn_retries tweaking. Is the 30 seconds and 2 retries
too limiting or dangerous for a production machine?
All the best,
# Digital Spinner, Inc.
# Web Design, Development and Consulting.
# Phone: 802.948.2020
# Fax: 802.948.2749
This archive was generated by hypermail 2b28 : Mon Dec 20 1999 - 21:40:55 PST