Re: [htdig3-dev] Summary and patch for robots.txt


Subject: Re: [htdig3-dev] Summary and patch for robots.txt
From: Geoff Hutchison (ghutchis@wso.williams.edu)
Date: Tue Feb 08 2000 - 15:10:48 PST


At 9:55 PM +0200 2/8/00, Valdas Andrulis wrote:
>So there is the fix(i thinks this code was thought this way, common
>error with if else):
>
>--- htlib/HtRegex.cc.old Tue Feb 8 21:31:40 2000
>+++ htlib/HtRegex.cc Tue Feb 8 21:32:21 2000
>@@ -39,11 +39,15 @@
> if (str == NULL) return;
> if (strlen(str) <= 0) return;
> if (!case_sensitive)
>+ {
> if (regcomp(&re, str, REG_EXTENDED|REG_ICASE) == 0)
> compiled = 1;
>+ }
> else
>+ {
> if (regcomp(&re, str, REG_EXTENDED) == 0)
> compiled = 1;
>+ }
> }
>
> void

Whoops! This is a good bug-fix. This is probably going to cause a
number of problems with things like exclude_urls and limit_urls_to as
well.

As for the robots.tx, I think we want to stick to the first matching
section. Any matching section overrides the *, but I think Gilles's
code is what we want.

i.e.

User-agent: *
Disallow: *
User-agent: htdig
Disallow: /cgi-bin/
Disallow: /cat/

I think this is the typical (and expected) format. If Loic's search
turns up some interesting examples of other formats, we may want to
consider a more liberal parser. I think we probably want to consider
an Allow section, but it would be a bit tricky.

-Geoff

P.S. I'm currently quite swamped, so I will probably not be
responding to much discussion--I don't want to rush off a response
and stick my foot in my mouth!

------------------------------------
To unsubscribe from the htdig3-dev mailing list, send a message to
htdig3-dev-unsubscribe@htdig.org
You will receive a message to confirm this.



This archive was generated by hypermail 2b28 : Tue Feb 08 2000 - 15:13:56 PST