Re: [htdig] Infinite loop problem with htdig


Subject: Re: [htdig] Infinite loop problem with htdig
From: Gilles Detillieux (grdetil@scrc.umanitoba.ca)
Date: Fri Jun 09 2000 - 13:49:24 PDT


I can help with the first problem...

According to Joe Baker:
> Problem 1) htdig seems to be getting into an infinite loop, bouncing back and
> forth between directories, creating links like the following,
>
> 12555:121140:3:http://www.amnestyusa.org/countries/colombia/index.html/actions/r
> eports/blueprint/reports/blueprint/reports/blueprint/reports/blueprint/reports/b
> lueprint/reports/blueprint/reports/blueprint/reports/blueprint/senate12221999.ht
> ml:
>
> The actual directories are /home/aiusa/public_html/countries/colombia/actions
> and /home/aiusa/public_html/countries/colombia/actions
>
> htdig just runs until it exceeds quota.
>
> I don't understand how the index.html gets buried in the link. I saw a similar
> problem with the older version

This happens when a document has an improper link to an SSI document, i.e.
an href that contains a trailing slash after the .html (if XBitHack is
enabled) or .shtml suffix. For normal, non-server-parsed HTML files,
this trailing slash would be illegal, but for SSI and CGI pages - which
can be programmed to interpret the extra path information - it's allowed.

However, very few SSI documents will actually make use of this, so a
link that adds the extra slash will cause problems because the page URL
will look like a directory URL to any web client (browser or spider),
and relative URLs will just point right back to the same page, with the
extra path information being ignored, rather than to the intended page.

You can avoid htdig tripping up on these by adding .html/ and .shtml/
to your exclude_urls attribute in htdig.conf.

E.g.:

exclude_urls: /cgi-bin/ .cgi .html/ .htm/ .shtml/

It might still be a good idea to find and fix the offending link(s)
to avoid problems with other spiders, to ensure the SSI pages aren't
ignored altogether, and to avoid confusing users with a page whose links
will all seem to point right back to itself.

-- 
Gilles R. Detillieux              E-mail: <grdetil@scrc.umanitoba.ca>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:    (204)789-3930

------------------------------------ To unsubscribe from the htdig mailing list, send a message to htdig-unsubscribe@htdig.org You will receive a message to confirm this.



This archive was generated by hypermail 2b28 : Fri Jun 09 2000 - 11:39:28 PDT