[htdig] Avoiding multiple (identical) search results


Ivan Trundle (ivan.trundle@alia.org.au)
Thu, 18 Mar 1999 11:27:01 +1100


I've been following the threads on indexing only html files, since I have a similar problem. I have tried implementing the solution offered by Geoff earlier [limit_urls_to: html / and limit_normalized: ${start_url}] but it doesn't seem to work for me: I still can't prevent quadruple instances of search results being shown in every instance. Have I overlooked something. Here is what I get:

http://www.alia.org.au/
http://www.alia.org.au/home.html
http://www.alia.org.au/alia/
http://www.alia.org.au/alia/home.html
(all leading to the same document)

Two issues arise: Our Apache 1.3.4 server is configured to interpret requests for http://www.alia.org.au/ as either ../index.html or ../home.html. How can I stop ht://dig from calling up both instances of each interpretation?

The other issue is related, and I suspect both issues are related to a misconfigured htdig.conf.

Our server has web documents stored at /usr/local/www/alia/, but visitors should only see files from /alia/ inwards (historical reasons, and to allow virtual servers alongside in other directories). The URL of http://www.alia.org.au/alia/xyz.html is technically possible to serve up, but in reality the .../alia/... component is not required, and http://www.alia.org.au/xyz.html is preferred. Can I somehow configure ht://dig to only offer the one result? Or is this beyond the scope of ht://dig?

As a matter of interest, I've configured start_url: to http://www.alia.org.au/

Thanks in advance, Ivan
------------------------------------
To unsubscribe from the htdig mailing list, send a message to
htdig@htdig.org containing the single word "unsubscribe" in
the SUBJECT of the message.



This archive was generated by hypermail 2.0b3 on Fri Mar 19 1999 - 17:32:54 PST