Re: [htdig] robots.txt results in not indexing a whole site?


Subject: Re: [htdig] robots.txt results in not indexing a whole site?
From: Geoff Hutchison (ghutchis@wso.williams.edu)
Date: Thu Aug 17 2000 - 17:14:04 PDT


At 4:52 PM -0700 8/17/00, boerio@arocknid.com wrote:
> Rejected: Item in the exclude list: item # 3 length: 1
>
> url rejected: (level 1)http://www.DOMAIN.com/index.html
>
>My problem is likely in this "exclude list" but I don't know where that's
>coming from. There's nothing in the htdig.conf file that would indicate
>such a list, and I don't think I'm intentionally doing anything.

There are several reasons that it's rejected. There are the
limit_urls_to and exclude_urls attributes in your htdig.conf as well
as the robots.txt file you mentioned in the subject of your message.
The latter is included if you've turned on this much debugging
information--the patterns in the robots.txt file are spit out when
htdig first starts indexing the server.

It's hard to say more since you haven't given a concrete example or
the relevent sections of your htdig.conf or robots.txt files.

--
-Geoff Hutchison
Williams Students Online
http://wso.williams.edu/

------------------------------------ To unsubscribe from the htdig mailing list, send a message to htdig-unsubscribe@htdig.org You will receive a message to confirm this. List archives: <http://www.htdig.org/mail/menu.html> FAQ: <http://www.htdig.org/FAQ.html>



This archive was generated by hypermail 2b28 : Thu Aug 17 2000 - 17:35:26 PDT