htdig: robots.txt and case-sensitivity


Tobias Brasier (tobias@sc.edu)
Fri, 06 Feb 1998 09:27:56 -0500


htdig, via HTTP (a case-sensitive protocol, regardless of the OS the web
server is on), follows links to web sites it wants to index. It reads first
the robots.txt file located at the root level of the primary document
directory.

For example:
htdig is indexing servers within the somesite.org domain, and begins at URL
http://www.somesite.org/index.html, and follows a link to the target page
at http://web.somesite.org/FRED/fred.html. In this example, the latter
server is a WindowsNT machine (a case-insensitive OS) running Netscape
Enterprise Server 3x.

The robot exclusion file at web.somesite.org/robots.txt contains the
following line:
     Disallow: /FRED/

The target page in question can be accessed by
http://web.somesite.org/FRED/fred.html, by
http://web.somesite.org/FRed/FrEd.hTMl, by
http://web.somesite.org/fred/FRED.html, or any other case-variation.
However, htdig is looking to NOT index
http://web.somesite.org/FRED/fred.html (the link it is following), but CAN
STILL index http://web.somesite.org/FRed/FrEd.hTMl,
http://web.somesite.org/fred/FRED.html, or any other case-variation.

My questions:
- Is there a configuration in htdig that I am missing that would handle
this problem?
- Can a web server be configured to present URLs in a certain case? I
haven't found it in the web server software in question.
- How can the webmaster of web.somesite.org disallow indexing of any
directories for certain? List every case variation of a directory name in
robots.txt?
- Does there need to be a link in one of those other case variations for
htdig to follow that causes a target page to be indexed?
- Does any of the preceding make sense? Have I made any incorrect assumptions?

Thank you all very much.

  Tobias A. Brasier
  Webmaster - The University of South Carolina
  Internet Solutions Group - Division of Libraries & Information Systems
  1244 Blossom Street, Columbia, South Carolina 29208
  voice: (803) 777-5211 | fax: (803) 777-4149
  mailto:tobias@sc.edu | http://mel.csd.sc.edu/~tobias/

----------------------------------------------------------------------
To unsubscribe from the htdig mailing list, send a message to
htdig-request@sdsu.edu containing the single word "unsubscribe" in
the body of the message.



This archive was generated by hypermail 2.0b3 on Sat Jan 02 1999 - 16:25:41 PST