htdig: htdig URL duplication


Danny Birchall (D.P.Birchall@sussex.ac.uk)
Mon, 25 Jan 1999 11:04:07 +0000 (GMT)


For historical reasons we have in our server configuaration a number of
server aliases which serve to cut out part of the URL to make it
shorter: eg
www.sussex.ac.uk/Units/foo/
becomes
www.sussex.ac.uk/foo/.
This causes a problem when running htdig, because inevitably somewhere
within our document tree, pages will be referenced both as /Units/foo/
and as /foo/. Result: ht://Dig indexes the same pages twice, with
different URLs, and when a htsearch is run, each pages is returned
twice, once with each URL.

At first we managed to get round this problem by using the local_urls
attribute in the config file. Together with a patch which modified htdig

to note the inode of the file being indexed and thus prevent the same
physical file being referenced a second time, we managed to eliminate
duplicates by checking that they only existed once on the filesystem.

The problem is that this only works for each complete new run of htdig.
When we run an update, all the duplications reappear.

Can anybody think of a workaround, however elaborate? This problem is
all that's stopping us from using htdig on our site.

Thanks

--------------------------------------------------
Danny Birchall
Editor
University of Sussex Information Service
http://www.sussex.ac.uk/

D.P.Birchall@sussex.ac.uk
Tel: (0)1273 678745
Fax: (0)1273 678441
---------------------------------------------------

----------------------------------------------------------------------
To unsubscribe from the htdig mailing list, send a message to
htdig-request@sdsu.edu containing the single word "unsubscribe" in
the body of the message.



This archive was generated by hypermail 2.0b3 on Mon Jan 25 1999 - 08:15:25 PST