Re: [htdig] excluding file trees from indexing process


Subject: Re: [htdig] excluding file trees from indexing process
From: Bill Carlson (wcarlson@vh.org)
Date: Tue Nov 30 1999 - 06:26:35 PST


On Tue, 30 Nov 1999, Torsten Neuer wrote:

> Jens Moellenhoff wrote:
> >
> > Hello,
> >
> > This may be just another one of these newbie questions, but how can I
> > exclude virtual file trees from being indexed? Whenever I enter the
> > keyword "index" in my search form, it returns a lot of hits like
> > "Index of folder1/folder2/folder3/" and shows the folder's index when I
> > click on one of these hits.
> >
> If you need the virtual trees to be walked by the indexer (e.g. in order
> to fetch some non-HTML documents from them), you cannot use the
> exclude_urls
> directive of Ht://Dig. Since the index is generated automatically by
> your
> web server, you need to add some indexer control information to this
> auto-
> generation of index documents.
>
> A portable approach would be to back off from automatical indexing by
> the
> web server and switch to some server side scripting (server-parsed HTML,
> PHP, ASP or some CGI) which produce the directory listings (this would
> also allow you to add some design to it). These listings should include
> a proper "robots" meta tag (or be stuffed with Ht://Dig specific indexer
> control) to control the dig process.
>
> For the Apache web server, you could also hack the mod_autoindex to
> also include robots control.
>

Two other approaches:

1) Turn off Fancy Indexing, if you don't need it. I would think it is most
helpful to folks editing sites, for a production site it shouldn't be
needed.

2) Setup a seperate DNS and Virtual Host for your site with Fancy Indexing
off. For example, way your site is www.site.org. Setup htdig.site.org to
point to the same IP and copy the Apache setup of www.site.org to
htdig.site.org and add FancyIndexing off. You will need to use the
url_part_aliases directive in htdig to get this to properly return the
correct search urls, but it works just fine. I use this at my site to
crawl a local copy of the site rather than load the real webservers and
network when I reindex.

HTH,

Bill Carlson
------------
Systems Programmer bill-carlson@uiowa.edu | Opinions are mine,
Virtual Hospital http://www.vh.org/ | not my employer's.
University of Iowa Hospitals and Clinics |

------------------------------------
To unsubscribe from the htdig mailing list, send a message to
htdig-unsubscribe@htdig.org
You will receive a message to confirm this.



This archive was generated by hypermail 2b25 : Tue Nov 30 1999 - 06:39:06 PST