Subject: Re: [htdig] The same page appears several times
From: Gilles Detillieux (firstname.lastname@example.org)
Date: Wed Aug 02 2000 - 09:17:21 PDT
According to Stephane ALLAIRE:
> Is it to possible to limit the number of times the same page appears in a
> result search ?
> I order my result by title and often the same page appear several time...
> It's not a big problem but it's not great to see the same document 3 or 4
> times in the same list... I've a look in the parameters of ht-dig but i've
> found nothnig or... i've missed the good one !
htdig/htsearch do not have duplicate document detection, so the onus is
on you to figure out how to avoid them. The first step is to figure out
why htdig comes across the same document multiple times using different
URLs. htdig does keep track of visited URLs, so when you have duplicate
documents in the database, they always have different URLs (sometimes
only subtly different).
Common causes of this are:
1) different host names referring to the same server, or to duplicates sets
of pages on different servers. Use server_aliases to correct this.
2) symbolic links giving alternate paths to the same documents. You need
to build a list of these, to feed into htdig's exclude_urls attribute,
so it won't get at the pages using those URLs. Another way is to use
redirects instead of symbolic links. htdig will follow the redirects to
the actual page URLs, and realise they've already been indexed. A third
alternative, but a more tedious one, is to change links in your documents
so that any page is always referenced using the same URL, to avoid htdig
picking up references to alternative paths.
3) dynamic pages with different CGI parameters yielding the same content.
This is similar to (2) above, but may be more difficult to deal with
unless you can fix your CGI scripts to always use consistent URLs.
-- Gilles R. Detillieux E-mail: <email@example.com> Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/~grdetil Dept. Physiology, U. of Manitoba Phone: (204)789-3766 Winnipeg, MB R3E 3J7 (Canada) Fax: (204)789-3930
------------------------------------ To unsubscribe from the htdig mailing list, send a message to firstname.lastname@example.org You will receive a message to confirm this.
This archive was generated by hypermail 2b28 : Tue Aug 01 2000 - 23:16:27 PDT