Re: [htdig] One solution for slow dig on Linux.

Subject: Re: [htdig] One solution for slow dig on Linux.
From: Torsten Neuer (
Date: Tue Dec 21 1999 - 10:17:25 PST

Sean Pecor wrote:
> Hello Torsten,
> > As of 3.1.2 there was already a patch solution for this which has been
> > incorporated into 3.1.4 and which is much cleaner than just renaming
> > REQUEST_METHOD. In other words, you applied a patch for something the
> > search engine is already able to do ;-)
> Grin! How do I utilize this feature? When I passed the query_string as an
> argument to htsearch, it ignored it and instead detected the REQUEST_METHOD
> environment variable in place for the encapsulating CGI and grabbed the
> actual QUERY_STRING variable.

Htsearch first checks for a valid QUERY_STRING, so you will have to
unset this
when calling it from a CGI (there is no such problem with "normal"
that use server scripting instead). Then call htsearch with the
QUERY_STRING (i.e. just like it would look like when being set up by a
<form>) on the command line (make sure to shell-escape all parameters
can be modified by the HTML <form> or else users might be able to
execute ar-
bitrary code on your machine).

> sensitive to bandwidth issues and want to keep access to a minimum (I've
> even removed the robots.txt retrieval logic because I'm not actually using
> htdig to spider pages).

Removing a standard means of restricting indexing is always a bad idea.
you don't crawl through the sites doesn't mean that you're "allowed" to
the start document of a site (maybe it serves highly dynamic content and
thus not suitable for indexing).

Robots.txt processing does not take up so much bandwidth that you would
gain much by removing it. On the other hand, you might find webmasters
angry on you for you do not pay attention to their robots control files!

> > If you have plenty of disk space, I'd even have a single small database
> > for every site being indexed (and have them merged after the index run),
> > in which case you can run multiple instances of the indexer concurrently
> > (you can then have a merger process waiting for new input to be merged
> > into the new search database). That should further increase the speed
> > of the in-dexer process.
> Another good approach. However, unless I am misunderstanding your
> suggestion, Linux hates having thousands of files in a single directory and
> its performance is severely penalized during file i/o in this case. Not a
> serious problem, but you'd have to build more management logic to spread the
> files throughout many directories (i.e., /dbs/o/on/onesite.db,
> /dbs/t/tw/twosite.db, etc.). I think I'll just wait until htdig is
> multi-threaded ;).

As you said, there is no need to put thousands of files in a single
You can have multiple configurations, one for each indexer process. You
therefore also have one db directory for each indexer process. The
logic that splits up the URL list and feeds it to the different indexer
ses can be made fairly simple (you can set up a database for that one,

Waiting for the indexer to become multi-threaded is of course a much
option ;-)



InWise - Wirtschaftlich-Wissenschaftlicher Internet Service GmbH
Waldhofstraße 14                            Tel: +49-4101-403605
D-25474 Ellerbek                            Fax: +49-4101-403606
E-Mail:            Internet:

------------------------------------ To unsubscribe from the htdig mailing list, send a message to You will receive a message to confirm this.

This archive was generated by hypermail 2b28 : Tue Dec 21 1999 - 10:32:05 PST