Re: [htdig] Indexing scope


Subject: Re: [htdig] Indexing scope
From: Gilles Detillieux (grdetil@scrc.umanitoba.ca)
Date: Mon Apr 17 2000 - 15:31:43 PDT


According to Geoff Hutchison:
> On Sun, 16 Apr 2000, Dave Lers wrote:
> > So the second dig is always adding one hop to the local database_one, that
> > works (I assume the local hops to local files/dirs that were already indexed
> > pose no problems*). Do I have to mess with htdig-dbgen? That file makes just
> > about 0 sense to me.
>
> Yes, as I described it, the second dig is always adding one hop to the
> previous one.
>
> I'm not sure what you're talking about with "htdig-dbgen." It sounds like
> a script provided by a binary package--it's not part of the source
> distribution.

It's part of the RPM distribution. It's just a symbolic link to rundig,
placed in /etc/cron.daily, so that rundig is automatically run by cron
in the wee hours. It's just there by default so that the htdig RPM will
automatically rebuild the index nightly. It works great for the typical
small site (like mine), but is not a good idea for larger sites, where
you want to schedule nightly updates and less frequent rebuilds, or in
your case where you want to build the database incrementally. You can
alway edit /usr/sbin/rundig as you wish, or remove the symbolic link and
replace it with a shell script of your own fabrication.

It's just a Bourne shell script, so if it makes 0 sense to you, you should
read up a bit on shell programming before attempting to customise it.

> > *How does Htdig handle those foo/?=D type auto indexes (an Apache thing?)?
> > Watching dig I seem to remember a long run of *'s (I ran one search script that
> > indexed these as separate URL's)
>
> Sigh. If you have Apache's FancyIndexing turned on, you'll get links at
> the top. Since these are links to "new pages" you'll get essentially
> duplicate copies of these indexes, though the pages linked from them
> aren't affected.
>
> I usually add "?" to exclude_urls to get rid of these. There's not much
> the indexer can do since they really are different pages.

If you need to index any CGI scripts with URL parameters, and therefore
can't exclude all URLs containing a "?", you can add more specific patterns
to exclude_urls to exclude the duplicate index pages. E.g.:

exclude_urls: ?D=A ?D=D ?M=A ?M=D ?N=A ?N=D ?S=A ?S=D

-- 
Gilles R. Detillieux              E-mail: <grdetil@scrc.umanitoba.ca>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:    (204)789-3930

------------------------------------ To unsubscribe from the htdig mailing list, send a message to htdig-unsubscribe@htdig.org You will receive a message to confirm this.



This archive was generated by hypermail 2b28 : Mon Apr 17 2000 - 13:17:39 PDT