Re: [htdig3-dev] Re: robots.txt bug (was [ANNOUNCE] ht://Dig


Subject: Re: [htdig3-dev] Re: robots.txt bug (was [ANNOUNCE] ht://Dig
From: Gilles Detillieux (grdetil@scrc.umanitoba.ca)
Date: Tue Feb 08 2000 - 11:13:38 PST


According to Geoff Hutchison:
> On Tue, 8 Feb 2000 loic@ceic.com wrote:
>
> > > given that rest is char * and not const char *? What will (String) =
> > > (char *) default to?
> >
> > No it's not. The opposite would be, barking when (String) = (const char*).
>
> Uhm. I hope I'm not the only one confused by your statement Loic. Are you
> saying that this code should be OK?

I'm pretty sure he meant the code is OK at that point.

> If so, someone else please take a look at that section of Server.cc. I
> don't see why it would present "/foobar/" instead of "/cat/|/foobar/"
> unless something is fouling up that conditional.

Found it!

>Trying to retrieve robots.txt file
>Parsing robots.txt file using myname = htdig
>Found 'user-agent' line: htdig
>Found 'disallow' line: /cat/
>Found 'user-agent' line: htdig
>Found 'disallow' line: /foobar/
>Pattern: /foobar/

The problem is the second User-agent line, which causes the previous
pattern to be cleared. I don't know whether the robots.txt file or the
parsing is incorrect, but one or the other has to change. The code expects
that a line of "User-agent: htdig" will begin a new User-agent section
which will override any previous User-agent sections. If it's correct
form to have multiple User-agent sections for a given User-agent, then
the code is wrong. If the standard requires that all Disallow entries
for one User-agent fall under a single User-agent heading, then the
file above is incorrect.

The old standard was vague on this point, but the examples never showed
more than one User-agent field bearing the same name. It says the robot
"should be liberal in interpreting this field." But it also says, in
regards to the "User-agent: *" record, that it is "not allowed to have
multiple such records".

According to the new draft standard, it would appear that both the file
above and the current code are incorrect - it should only use the FIRST
matching section...

3.2.1 The User-agent line

   Name tokens are used to allow robots to identify themselves via a
   simple product token. Name tokens should be short and to the
   point. The name token a robot chooses for itself should be sent
   as part of the HTTP User-agent header, and must be well documented.

   These name tokens are used in User-agent lines in /robots.txt to
   identify to which specific robots the record applies. The robot
   must obey the first record in /robots.txt that contains a User-
   Agent line whose value contains the name token of the robot as a
   substring. The name comparisons are case-insensitive. If no such
   record exists, it should obey the first record with a User-agent
   line with a "*" value, if present. If no record satisfied either
   condition, or no records are present at all, access is unlimited.

To implement this we should do the following when the name matches:

                if (!seen_mynme)
                {
                    seen_myname = 1;
                    pay_attention = 1;
                    pattern = 0;
                }
                else
                    pay_attention = 0;

If we don't want a rigorous implementation of the draft, which also
defines use of the Allow record, but want instead a more liberal
interpretation, we can leave off the else clause, and it will continue
to accept lines from an adjacent User-agent section. To be even more
liberal, and not even require that the sections be adjacent, we should
always set pay_attention to 1 when the name matches, but only clear the
pattern when the name is first seen.

The docs on http://info.webcrawler.com/mak/projects/robots/robots.html
suggest that the draft specification "is not yet completed or implemented,"
so I don't know how rigorously we'd want to enforce it.

-- 
Gilles R. Detillieux              E-mail: <grdetil@scrc.umanitoba.ca>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:    (204)789-3930

------------------------------------ To unsubscribe from the htdig3-dev mailing list, send a message to htdig3-dev-unsubscribe@htdig.org You will receive a message to confirm this.



This archive was generated by hypermail 2b28 : Tue Feb 08 2000 - 11:16:07 PST