Re: Fw: [htdig] mutiple search results


Gilles Detillieux (grdetil@scrc.umanitoba.ca)
Wed, 27 Oct 1999 09:33:31 -0500 (CDT)


According to Doug McCallum:
> I was wondering if someone could provide us with some direction as to why --
> When htdig results are returned they are mutiple duplicates of the same file.?

According to Torsten Neuer:
> Possible reasons (which are all HTTP server related) include:
>
> - The server is not case-sensitive with regards to URLs; some
> hyperlinks to the same document are written differently.
> See http://www.htdig.org/attrs.html#case_sensitive

As far as I know, the case_sensitive option still only affects matching of
names in robots.txt, but not determining whether an URL has been visited.
Someone was supposed to be working on that, but I don't think any results
of that work ever made it back into the source trees. Inconsistent letter
cases in links to a given URL are still a problem.

> - The server got multiple names (which are not different virtual
> hosts), causing documents to appear once for every server name.
> See http://www.htdig.org/attrs.html#server_aliases

This is probably THE most common cause of duplicates, based on what I've
read on this list. Sometimes the difference is subtle, like a trailing
"." at the end of a domain name. Luckily, the fix is fairly easy if
you're not dealing with a huge number of servers. (I have one server
with 2 aliases.)

> - The documents are retrieved using GET with a session id as an
> URL parameter. In order to fix this, you will have to postpro-
> cess the result of the htsearch query with a wrapper script.

Not a common problem, which is a good thing because it's really not easy
to fix. A while back, I proposed a couple ideas for URL modifications
in htdig or htsearch, but nothing came of it. I guess the folks dealing
with these problems figured the solution wasn't worth the effort.

> - You symbolic links, causing the same document served under
> different names. In order to get around this problem, you
> will probably need to exclude the URL from the dig.
> See http://www.htdig.org/attrs.html#exclude_urls

On my server, I sometimes use symbolic links to preserve old URLs, so
that outside sites that do deep linking to my site won't suddenly have
link rot. However, I'm always careful to avoid using old URLs within
my site. Just like for the case insensitive server problem, and the
host alias problem, consistency in URLs is the key to avoiding duplicates.
(Easier said than done, I know.)

-- 
Gilles R. Detillieux              E-mail: <grdetil@scrc.umanitoba.ca>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:    (204)789-3930

------------------------------------ To unsubscribe from the htdig mailing list, send a message to htdig@htdig.org containing the single word unsubscribe in the SUBJECT of the message.



This archive was generated by hypermail 2.0b3 on Wed Oct 27 1999 - 07:42:48 PDT