Re: [htdig] Indexing large amount of non-related files


Subject: Re: [htdig] Indexing large amount of non-related files
From: Marcel Hicking (m.hicking@via-net-works.de)
Date: Wed May 24 2000 - 03:54:05 PDT


Since I dont't have a document referring all files
to be indexed, I'm thinking of generating a
start_url file "on the fly".

I have been doing this for a much smaller site:
I have set up a little shell script to generate
a list with all available files and send it through
sed to convert local paths to http://...-URLs.
ht://dig is set up with start_url=allfiles.list
and a local_urls line to "undo" the above mapping
again.

Do you think this is appropriate for a larger search
or do you have any other suggestions?

Marcel

On 23 May 00, at 17:09, Geoff Hutchison wrote:

> At 6:51 PM +0200 5/23/00, Marcel Hicking wrote:
> >I have at about 200,000 plain text files
> >spread over a few 100, maybe 1000, directories.
> >File size is between a few bytes and, sometimes,
> >above 1mb. All in all this ends up in 1.2gb
> >of data, growing daily. The files do not
> >contain HTML code and I need them to be
> >indexed at least daily (that is, nightly ;-)
> >Most of the files are static, only few of them
> >change, say, 100-200 a day.
>
> Well I don't think you'll have much problem indexing them with
> ht://Dig. As to performance, it depends a lot on your machine and the
> data itself. It sounds like you might get some use out of local_urls,
> though if they don't have extensions, you might see it hit the HTTP
> server a lot as it tries to figure out the MIME type.
>
> Also remember that ht://Dig currently doesn't have any sort of "index
> this directory" feature.
>
> --
> -Geoff Hutchison
> Williams Students Online
> http://wso.williams.edu/

------------------------------------
To unsubscribe from the htdig mailing list, send a message to
htdig-unsubscribe@htdig.org
You will receive a message to confirm this.



This archive was generated by hypermail 2b28 : Wed May 24 2000 - 01:42:49 PDT