Gilles Detillieux (email@example.com)
Mon, 5 Jul 1999 11:43:12 -0500 (CDT)
According to Jim Cole:
> As far as I can tell, the melt down originates with a small syntax error
> in one users page. This user had a link that looks like...
> <a href="../../index.html/">
> This then resolved to a new, unique URL of
> http://www.########.org/index.html/ So, htdig went ahead and processed
> it as such. When relative links were found in the index.html file, new
> URLs were generated, such as
> When htdig did the GET on this specific URL, the server of course
> returned index.html instead of qpost.html, but treated relative links in
> index.html as if they were relative to
> http://www.########.org/index.html/queries, which generated URLs like
> http://www.########.org/index.html/queries/fyi/tngnfind.html This
> process continued, generating longer and longer bogus URLs. Not sure
> what finally broke the cycle.
> I am in the process of trying to crawl the site again with .html/ and
> .htm/ added to the exclude_urls attribute. On the off chance that this
> doesn't work, does anyone have other ideas about how to avoid this
> problem? Well, short of validating thousands of pages contributed by
> dozens of people? ;)
Validating the pages, or dealing with these problems as you find them,
are really your only options. The exclude_urls fix should prevent the
runaway digging problem you had, but there's no guarantee that other
broken URLs won't cause some other problem. Part of the problem you
had is your HTTP server is pretty lenient as to what it accepts as a
valid URL. My apache server gives a 404 error if I add a trailing slash
to an index.html file. Your server's lenience just made it harder to find
the error. You found it because your htdig ran amock. Other visitors
to your site may have found that if they follow the broken link back
to your home page, then none of the links on that page work right.
They may have just gotten fed up and left.
There's only so much one can do to make a spider more idiot-proof.
People are just too resourceful at making idiotic mistakes, so it's
impossible to anticipate them all. :-)
-- Gilles R. Detillieux E-mail: <firstname.lastname@example.org> Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/~grdetil Dept. Physiology, U. of Manitoba Phone: (204)789-3766 Winnipeg, MB R3E 3J7 (Canada) Fax: (204)789-3930 ------------------------------------ To unsubscribe from the htdig mailing list, send a message to email@example.com containing the single word "unsubscribe" in the SUBJECT of the message.
This archive was generated by hypermail 2.0b3 on Mon Jul 05 1999 - 09:04:40 PDT