Ray Krebs (vrkrebs@infi.net)
Wed, 10 Dec 1997 13:51:48 -0500 (EST)
Hello Everyone,
I'm running HTDIG and indexing 8 different sites.
The trouble is that at some where in all the wwwpages there is a URL that
sends HTDIG back to the same site with a slight varation in the ULR that makes it
unique. So the same site gets indexed twice.
Example URLs:
http://www.sitename1.com/document.html
http://www.sitename1.com//document.html
^
As you can see the only difference is the extra "/" slash. HTDIG hits this and
see's it as being unique and ends up starting a whole new search doubling the
content of one of the 8 sites. Searches to this are also returning
duplicates as well.
My config file setup:
--------------------
start_url: http://www.sitename1.com http://www.sitename2.com ...
http://www.sitename8.com
limit_urls_to: ${start_url}
Is there something that I can use to 'filter' out all of these extras?
I have tried using the exclude_urls with www.sitename2.com// but this doesn't
seem to work.
Are there other ways to exclude this?
Thanks for any assistance,
Ray
---------------------------------------------------------------
V. Ray Krebs III InfiNet Publishing Systems Administrator
vrkrebs@infi.net Phone: 757-624-2295 x3310 Fax: 757-627-2498
---------------------------------------------------------------
----------------------------------------------------------------------
To unsubscribe from the htdig mailing list, send a message to
htdig-request@sdsu.edu containing the single word "unsubscribe" in
the body of the message.
This archive was generated by hypermail 2.0b3 on Sat Jan 02 1999 - 16:25:24 PST