Subject: Re: [htdig3-dev] Re: robots.txt bug (was [ANNOUNCE] ht://Dig
From: Gilles Detillieux (email@example.com)
Date: Tue Feb 08 2000 - 11:13:38 PST
According to Geoff Hutchison:
> On Tue, 8 Feb 2000 firstname.lastname@example.org wrote:
> > > given that rest is char * and not const char *? What will (String) =
> > > (char *) default to?
> > No it's not. The opposite would be, barking when (String) = (const char*).
> Uhm. I hope I'm not the only one confused by your statement Loic. Are you
> saying that this code should be OK?
I'm pretty sure he meant the code is OK at that point.
> If so, someone else please take a look at that section of Server.cc. I
> don't see why it would present "/foobar/" instead of "/cat/|/foobar/"
> unless something is fouling up that conditional.
>Trying to retrieve robots.txt file
>Parsing robots.txt file using myname = htdig
>Found 'user-agent' line: htdig
>Found 'disallow' line: /cat/
>Found 'user-agent' line: htdig
>Found 'disallow' line: /foobar/
The problem is the second User-agent line, which causes the previous
pattern to be cleared. I don't know whether the robots.txt file or the
parsing is incorrect, but one or the other has to change. The code expects
that a line of "User-agent: htdig" will begin a new User-agent section
which will override any previous User-agent sections. If it's correct
form to have multiple User-agent sections for a given User-agent, then
the code is wrong. If the standard requires that all Disallow entries
for one User-agent fall under a single User-agent heading, then the
file above is incorrect.
The old standard was vague on this point, but the examples never showed
more than one User-agent field bearing the same name. It says the robot
"should be liberal in interpreting this field." But it also says, in
regards to the "User-agent: *" record, that it is "not allowed to have
multiple such records".
According to the new draft standard, it would appear that both the file
above and the current code are incorrect - it should only use the FIRST
3.2.1 The User-agent line
Name tokens are used to allow robots to identify themselves via a
simple product token. Name tokens should be short and to the
point. The name token a robot chooses for itself should be sent
as part of the HTTP User-agent header, and must be well documented.
These name tokens are used in User-agent lines in /robots.txt to
identify to which specific robots the record applies. The robot
must obey the first record in /robots.txt that contains a User-
Agent line whose value contains the name token of the robot as a
substring. The name comparisons are case-insensitive. If no such
record exists, it should obey the first record with a User-agent
line with a "*" value, if present. If no record satisfied either
condition, or no records are present at all, access is unlimited.
To implement this we should do the following when the name matches:
seen_myname = 1;
pay_attention = 1;
pattern = 0;
pay_attention = 0;
If we don't want a rigorous implementation of the draft, which also
defines use of the Allow record, but want instead a more liberal
interpretation, we can leave off the else clause, and it will continue
to accept lines from an adjacent User-agent section. To be even more
liberal, and not even require that the sections be adjacent, we should
always set pay_attention to 1 when the name matches, but only clear the
pattern when the name is first seen.
The docs on http://info.webcrawler.com/mak/projects/robots/robots.html
suggest that the draft specification "is not yet completed or implemented,"
so I don't know how rigorously we'd want to enforce it.
-- Gilles R. Detillieux E-mail: <email@example.com> Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/~grdetil Dept. Physiology, U. of Manitoba Phone: (204)789-3766 Winnipeg, MB R3E 3J7 (Canada) Fax: (204)789-3930
------------------------------------ To unsubscribe from the htdig3-dev mailing list, send a message to firstname.lastname@example.org You will receive a message to confirm this.
This archive was generated by hypermail 2b28 : Tue Feb 08 2000 - 11:16:07 PST