Re: [htdig] Duplicate pages


Subject: Re: [htdig] Duplicate pages
From: Gilles Detillieux (grdetil@scrc.umanitoba.ca)
Date: Wed Sep 20 2000 - 08:52:13 PDT


According to anr@att.net:
> The site I am indexing is a bit peculiar. The following
> is an example of the setup, where each page is exactly
> the same.
>
> www.domain.com/subdirectory/
> www.domain.com/subdirectory/index.html
> www.domain.com/Subdirectory/
> www.domain.com/Subdirectory/index.html
>
> I assumed that in the case where there is no index.html
> that it was just loading the index.html. Here's the
> problem. htdig recognizes this as 4 different pages,
> and indexes all of them. I can see where it would think
> it is 2 different because of the s and S. Is there any
> way to prevent the duplicates?

The remove_default_doc attribute should take care of the superfluous
"index.html" entries, but I'm not so sure about the extra Subdirectory
names. You can't use exclude_urls for this, because it does a case
insensitive match.

On my site, I make use of a few symbolic links for subdirectories, to
give an all-lowercase equivalent to some mixed case names, but I never
use these in URLs on my site, for this very reason. I only use them to
support links from other sites, where other admins may be a tad sloppy
about getting the case right. I realise this isn't a workable alternative
for you if you don't maintain control over the whole site you're indexing.

-- 
Gilles R. Detillieux              E-mail: <grdetil@scrc.umanitoba.ca>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:    (204)789-3930

------------------------------------ To unsubscribe from the htdig mailing list, send a message to htdig-unsubscribe@htdig.org You will receive a message to confirm this. List archives: <http://www.htdig.org/mail/menu.html> FAQ: <http://www.htdig.org/FAQ.html>



This archive was generated by hypermail 2b28 : Wed Sep 20 2000 - 08:55:04 PDT