htdig: Duplicate files with unique URLs


Ray Krebs (vrkrebs@infi.net)
Wed, 10 Dec 1997 13:51:48 -0500 (EST)


Hello Everyone,

I'm running HTDIG and indexing 8 different sites.

The trouble is that at some where in all the wwwpages there is a URL that
sends HTDIG back to the same site with a slight varation in the ULR that makes it
unique. So the same site gets indexed twice.

Example URLs:

http://www.sitename1.com/document.html

http://www.sitename1.com//document.html
                         ^

As you can see the only difference is the extra "/" slash. HTDIG hits this and
see's it as being unique and ends up starting a whole new search doubling the
content of one of the 8 sites. Searches to this are also returning
duplicates as well.

My config file setup:
--------------------

start_url: http://www.sitename1.com http://www.sitename2.com ...
http://www.sitename8.com

limit_urls_to: ${start_url}

Is there something that I can use to 'filter' out all of these extras?

I have tried using the exclude_urls with www.sitename2.com// but this doesn't
seem to work.

Are there other ways to exclude this?

Thanks for any assistance,
Ray

---------------------------------------------------------------
V. Ray Krebs III InfiNet Publishing Systems Administrator
vrkrebs@infi.net Phone: 757-624-2295 x3310 Fax: 757-627-2498
---------------------------------------------------------------

----------------------------------------------------------------------
To unsubscribe from the htdig mailing list, send a message to
htdig-request@sdsu.edu containing the single word "unsubscribe" in
the body of the message.



This archive was generated by hypermail 2.0b3 on Sat Jan 02 1999 - 16:25:24 PST