Re: [htdig] how to ignore robots.txt


Geoff Hutchison (ghutchis@wso.williams.edu)
Sun, 28 Mar 1999 23:16:44 -0500 (EST)


On Sun, 28 Mar 1999, p0222 wrote:

> How can I tell htdig to *ignore* the robots.txt-files, on the whole web or
> on specified servers ?
> That's my problem:
> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^EXLCUDE LIST ?!?
> How can i turn this exlcude list *OFF* ?!?

No, not quite. First off, you cannot turn off the robots.txt parsing. It's
a standard and if you have a problem with a server's robots.txt file, you
should really take it up with the webmaster.

That's not your problem. The default config file ships with the option:
exclude_urls: cgi-bin .cgi

So this option is excluding the option you mention. If you don't want
this, remove it. (One caveat... Currently, if you make exclude_urls empty,
it will ignore *all* URLs. So instead, set it to something that cannot
occur, like !-no-url-! and it won't exclude anything on the servers it
indexes.)

-Geoff Hutchison
Williams Students Online
http://wso.williams.edu/

------------------------------------
To unsubscribe from the htdig mailing list, send a message to
htdig@htdig.org containing the single word "unsubscribe" in
the SUBJECT of the message.



This archive was generated by hypermail 2.0b3 on Sun Mar 28 1999 - 21:27:46 PST