Re: htdig: Duplicate files with unique URLs


Keith D. Tyler (ktyler@law.harvard.edu)
Thu, 11 Dec 1997 09:52:54 -0500 (EST)


> > The trouble is that at some where in all the wwwpages there is a URL that
> > sends HTDIG back to the same site with a slight varation in the ULR that makes it
> > unique. So the same site gets indexed twice.
> I have exactly the same problem. My solution was to add "///" to
> the exclude_urls field of the conf file. You have to accept some
> duplicates (http://foo/bar/ and http://foo/bar//) but atleast htdig
> doesn't run forever.

Has anyone else also noticed the engine storing both "..dirname" and
"..dirname/" in the database? Or is that an old discussion?

Somewhere, the engine needs to be told to filter URLS along regeps
like '/+' and '/*$'. Repeating instances of the same page shouldn't
happen.

'(http://)?\w*(/+\w+)*/*$' is what I'm thinking. A matching function that
strings each \w together, converting '/+' to '/' and stripping '/*$'.
(okay okay, not all servers are nice about the latter like Apache.)

Making exclusion rules for these quirks aren't a good way to do this. What
happens if some of your pages are ONLY referenced by these quirky URLs?

Kdt
----------------------------------------------------------------------
To unsubscribe from the htdig mailing list, send a message to
htdig-request@sdsu.edu containing the single word "unsubscribe" in
the body of the message.



This archive was generated by hypermail 2.0b3 on Sat Jan 02 1999 - 16:25:24 PST