Re: [htdig] Indexing large amount of non-related files


Subject: Re: [htdig] Indexing large amount of non-related files
From: Marcel Hicking (m.hicking@via-net-works.de)
Date: Fri May 26 2000 - 04:48:48 PDT


Ah, well, sure, too easy to hide it ;-)

find /www/htdocs/ -name *.htm* -type f | sed 's/\/www\/htdocs/htt
p:\/\/www\.yourdomain\.com/' > /where/ever/you/need/it/allfiles.list

Limits the filetype to any *.htm* files (and ignores
directories named "foo.html") so you don't end up with
tons of image files in the file list.

In my config file I have:
start_url: `/where/ever/you/need/it/allfiles.list`
  (Note: this does not have to be within the htdocs tree)

local_urls: http://www.yourdomain.com=/www/htdocs/
  If you have server parsed html (like php), you certainly
  would won't use local_urls, although it speeds things up
  quite a bit.

  Maybe you would also like to add
limit_urls_to: ${start_url}
  as well.

Marcel

On 24 May 00, at 16:06, J. op den Brouw wrote:

>
> Are you willing to share this script with me? I need exactly that
> what you wrote.
>
> Thanx in advance.
>
> On Wed, 24 May 2000, Marcel Hicking wrote:
>
> > Since I dont't have a document referring all files
> > to be indexed, I'm thinking of generating a
> > start_url file "on the fly".
> >
> > I have been doing this for a much smaller site:
> > I have set up a little shell script to generate
> > a list with all available files and send it through
> > sed to convert local paths to http://...-URLs.
> > ht://dig is set up with start_url=allfiles.list
> > and a local_urls line to "undo" the above mapping
> > again.
> >
> > Do you think this is appropriate for a larger search
> > or do you have any other suggestions?
> >
> > Marcel
>
> --jesse
> --------------------------------------------------------------------
> J. op den Brouw Johanna Westerdijkplein 75
> Haagse Hogeschool 2521 EN DEN HAAG
> Faculty of Engeneering Netherlands
> Electrical Engeneering +31 70 4458936
> -------------------- J.E.J.opdenBrouw@st.hhs.nl --------------------
>
> Linux - because reboots are for hardware changes
>

------------------------------------
To unsubscribe from the htdig mailing list, send a message to
htdig-unsubscribe@htdig.org
You will receive a message to confirm this.



This archive was generated by hypermail 2b28 : Fri May 26 2000 - 02:37:47 PDT