Re: [htdig] robots.txt results in not indexing a whole site?


Subject: Re: [htdig] robots.txt results in not indexing a whole site?
From: Gilles Detillieux (grdetil@scrc.umanitoba.ca)
Date: Fri Aug 18 2000 - 08:24:53 PDT


According to boerio@arocknid.com:
> I'm using ht://Dig 3.1.4 on a Linux platform, and noticed that only one
> single site from some of my URL entries were getting indexed. I turned on
> all the debugging information, and this appears throughout:
>
> Rejected: Item in the exclude list: item # 3 length: 1

This error message refers to the third item in the exclude_urls attribute.
Unfortunately, there is no clear documentation explaining which error
messages correspond to which config attributes, and the error messages
themselves are not clear enough, so the only sure way to track down
some of these errors right now is to search for the messages in the
source code.

> url rejected: (level 1)http://www.DOMAIN.com/index.html

Again, an unclear message. Level 1 refers to the first round of
tests the URL must pass, based on bad_extensions, valid_extensions,
accepted protocol (http only for 3.1.x), exclude_urls, limit_urls_to, and
bad_querystr. This message appears at verbosity of 2 (-vv) or greater.
You need verbosity of at least 3 (-vvv) to get a better explanation,
which you did get in the earlier message above.

This is unrelated to robots.txt, which is checked at a later stage, and
gives a message about the URL being discarded, i.e.:

        robots.txt: discarding http://whatever...

> My problem is likely in this "exclude list" but I don't know where that's
> coming from. There's nothing in the htdig.conf file that would indicate
> such a list, and I don't think I'm intentionally doing anything.

The htdig.conf file doesn't come anywhere close to including all possible
configuration attributes. There are tons of them, and they all have
compiled-in defaults, so just because an attribute isn't in htdig.conf
it doesn't it's not set or used. You need to check the documentation
for the default settings of attributes not in your config file.

> I perused htdig.org and the faq, and perhaps I missed something,

http://www.htdig.org/attrs.html

> or perhaps
> its fixed in a different version, or more likely, is just something I don't
> have a clue about :-)

Well, in this case it's not fixed in the latest version, because it's not
a bug. However, there are some important bug fixes (including a major
security hole which is patched) in 3.1.5.

See:
http://www.htdig.org/RELEASE.html
http://www.htdig.org/ChangeLog

and for some fixes and enhancements since 3.1.5's release:

http://www.htdig.org/FAQ.html#q2.5

-- 
Gilles R. Detillieux              E-mail: <grdetil@scrc.umanitoba.ca>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:    (204)789-3930

------------------------------------ To unsubscribe from the htdig mailing list, send a message to htdig-unsubscribe@htdig.org You will receive a message to confirm this. List archives: <http://www.htdig.org/mail/menu.html> FAQ: <http://www.htdig.org/FAQ.html>



This archive was generated by hypermail 2b28 : Fri Aug 18 2000 - 08:25:14 PDT