Re: [htdig] Problems with cgis and collecting other sites


Subject: Re: [htdig] Problems with cgis and collecting other sites
From: Geoff Hutchison (ghutchis@wso.williams.edu)
Date: Thu Feb 10 2000 - 12:57:55 PST


On Thu, 10 Feb 2000, Walter Addison March wrote:

> but we have things like http://haverford.edu/acc/WebX that are cgis also.
> Is there some flag I am missing to tell htdig not to pick up cgis no matter
> what they might be named or does one have to figure out all the various
> cgis that we run for the several servers on campus and add each one to
> exclude_urls?

If you don't want to index CGIs, this is correct. If you were to show me
that URL, I would not have any way of knowing it was a CGI a priori. So
the same is true for htdig when indexing. Of course you don't have to
ignore CGIs--many people include them in their databases.

> was in the URL, is there a way to restrict htdig by IPs or something so
> that it doesn't follow links like that? Or, if there is a way to exclude
> cgis not based on their urls, would that work for this?

I'd use ? as a pattern in exclude_urls since that is a common way
to include data to a CGI.

> One last point, I did try adding /ugweb.cs.ualberta.ca/ to the exclude_urls
> and then ran an update... but that info is still there... is the
> information still there because I ran an update?

Correct. An update will not delete URLs from a database, period. So if you
want to get the URL out, currently you'll need to rebuild the databases
from scratch.

-Geoff Hutchison
Williams Students Online
http://wso.williams.edu/

------------------------------------
To unsubscribe from the htdig mailing list, send a message to
htdig-unsubscribe@htdig.org
You will receive a message to confirm this.



This archive was generated by hypermail 2b28 : Thu Feb 10 2000 - 13:00:28 PST