Re: [htdig] One solution for slow dig on Linux.


Subject: Re: [htdig] One solution for slow dig on Linux.
From: Torsten Neuer (tneuer@inwise.de)
Date: Tue Dec 21 1999 - 10:17:25 PST


Sean Pecor wrote:
>
> Hello Torsten,
>
> > As of 3.1.2 there was already a patch solution for this which has been
> > incorporated into 3.1.4 and which is much cleaner than just renaming
> > REQUEST_METHOD. In other words, you applied a patch for something the
> > search engine is already able to do ;-)
>
> Grin! How do I utilize this feature? When I passed the query_string as an
> argument to htsearch, it ignored it and instead detected the REQUEST_METHOD
> environment variable in place for the encapsulating CGI and grabbed the
> actual QUERY_STRING variable.

Htsearch first checks for a valid QUERY_STRING, so you will have to
unset this
when calling it from a CGI (there is no such problem with "normal"
wrappers
that use server scripting instead). Then call htsearch with the
complete
QUERY_STRING (i.e. just like it would look like when being set up by a
HTML
<form>) on the command line (make sure to shell-escape all parameters
which
can be modified by the HTML <form> or else users might be able to
execute ar-
bitrary code on your machine).

 
> sensitive to bandwidth issues and want to keep access to a minimum (I've
> even removed the robots.txt retrieval logic because I'm not actually using
> htdig to spider pages).

Removing a standard means of restricting indexing is always a bad idea.
That
you don't crawl through the sites doesn't mean that you're "allowed" to
index
the start document of a site (maybe it serves highly dynamic content and
is
thus not suitable for indexing).

Robots.txt processing does not take up so much bandwidth that you would
really
gain much by removing it. On the other hand, you might find webmasters
being
angry on you for you do not pay attention to their robots control files!

> > If you have plenty of disk space, I'd even have a single small database
> > for every site being indexed (and have them merged after the index run),
> > in which case you can run multiple instances of the indexer concurrently
> > (you can then have a merger process waiting for new input to be merged
> > into the new search database). That should further increase the speed
> > of the in-dexer process.
>
> Another good approach. However, unless I am misunderstanding your
> suggestion, Linux hates having thousands of files in a single directory and
> its performance is severely penalized during file i/o in this case. Not a
> serious problem, but you'd have to build more management logic to spread the
> files throughout many directories (i.e., /dbs/o/on/onesite.db,
> /dbs/t/tw/twosite.db, etc.). I think I'll just wait until htdig is
> multi-threaded ;).

As you said, there is no need to put thousands of files in a single
directory.
You can have multiple configurations, one for each indexer process. You
can
therefore also have one db directory for each indexer process. The
management
logic that splits up the URL list and feeds it to the different indexer
proces-
ses can be made fairly simple (you can set up a database for that one,
too).

Waiting for the indexer to become multi-threaded is of course a much
easier
option ;-)

cheers,

  Torsten

-- 
InWise - Wirtschaftlich-Wissenschaftlicher Internet Service GmbH
Waldhofstraße 14                            Tel: +49-4101-403605
D-25474 Ellerbek                            Fax: +49-4101-403606
E-Mail: info@inwise.de            Internet: http://www.inwise.de

------------------------------------ To unsubscribe from the htdig mailing list, send a message to htdig-unsubscribe@htdig.org You will receive a message to confirm this.



This archive was generated by hypermail 2b28 : Tue Dec 21 1999 - 10:32:05 PST