Subject: Re: [htdig3-dev] symlink bug
From: Gilles Detillieux (grdetil@scrc.umanitoba.ca)
Date: Fri Jul 28 2000 - 09:33:45 PDT
According to Geoff Hutchison:
> On Fri, 28 Jul 2000, Jonathan Bartlett wrote:
> > I once wrote a spider program that ran into the same problem. The way I
> > fixed it there was to have an option of the maximum URL size. This should
> > prevent such a loop. The default could be infinite, or just a really huge
> > number.
>
> Nah, max_hop_count is IMHO a more elegant way of doing it. Who knows why
> you might want to have some very long URL, but there's probably no reason
> to be desending beyond some number of hops from your top page.
>
> Of course a duplicate detection scheme (i.e. checksum the pages) would be
> nice, but it doesn't look like that's going to happen unless someone
> volunteers to do it soon.
I don't know that a checksum would catch this problem, if the "page"
being repeatedly indexed is a dynamically generated directory listing.
Apache would just keep lengthening the path, and that path shows up in
the title and h1 header of the page, so the checksum would be different
each time. I think exclude_urls is the best attribute for dealing with
this particular problem.
-- Gilles R. Detillieux E-mail: <grdetil@scrc.umanitoba.ca> Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/~grdetil Dept. Physiology, U. of Manitoba Phone: (204)789-3766 Winnipeg, MB R3E 3J7 (Canada) Fax: (204)789-3930------------------------------------ To unsubscribe from the htdig3-dev mailing list, send a message to htdig3-dev-unsubscribe@htdig.org You will receive a message to confirm this.
This archive was generated by hypermail 2b28 : Thu Jul 27 2000 - 23:32:52 PDT