[htdig] Does htmerge remove URL from database ?

Subject: [htdig] Does htmerge remove URL from database ?
From: Olivier Korn (olivier.korn@enseignant.org)
Date: Wed Nov 22 2000 - 09:29:14 PST


We were using ht://Dig for many months now and we didn't have to complain
about it but... There is something strange that I don't understand.

The way, we're using ht://Dig is described here :

1. We have 20 or so web sites named, say, http://www.site1.fr/a-path/,
http://www.site2.fr/a-path-which-does-not-read-the-same-as-site1/, and so
on. Some are MS-IIS, some are Linux/Apache hosted.

2. For each of these sites, I made up a site1.conf, site2.conf, (and so on)
containing start_url, restrict thing, (and so on.) Each of these .conf
includes a file named "_commun_include". Of course, I changed database
prefix for each of the sites.

3. Once a week, htdig is called on each site with "htdig -i -c site1.conf"
then "htdig -i -c site2.conf", (and so on.)

4. After all the sites have been htdigged, I run htmerge in sequence in
order to merge all the small databases into one.
First call is "htmerge -c site1.conf", subsequents call are "htmerge -c
site1.conf -m site2.conf", "htmerge -c site1.conf -m site3.conf", (and so on.)

5. Everything seems to work perfectly. Using htsearch, I can find documents
which are on any of the sites. Let's note for later that my locale is
correctly set so I don't have any problem with accents (I also use the
accents patch which works fine.) (I say all this because of the example I
give below.) ("htfuzzy accents" is run after all the htmerge.)

Here is the problem :

1. On site5, there is an HTML document named "Rénovation du BTS tourisme".
When searching for "rénovation tourisme" (method=and) the document is not
found (ht://Dig even says there is no document containing these words.)
Using the "restrict=http://www.site5.fr/site5-path-to-docs/" parameter
doesn't change anything (this is not a surprise but... I wanted to be sure.)

2. Now let's hear the amazing part of my story. If I do a "htmerge -c
site5.conf" (notice there is no -m this time.) and if I htsearch -c
site5.conf with "rénovation tourisme" my document is said to be found !
Said in another way, the document was indexed but was certainly ripped out
when merging with another database.

Well, I'd like to know if somebody already ran into this particular problem
or if it is a "feature" of htmerge (deleting entry when merging two
databases together.) What can I do against it ?

I'm really confused about all of this (this state of mind doesn't help me
to write correct english. Sorry about that.)

Olivier Korn
Strasbourg, France.

------------------------------------ To unsubscribe from the htdig mailing list, send a message to htdig-unsubscribe@htdig.org You will receive a message to confirm this. List archives: <http://www.htdig.org/mail/menu.html> FAQ: <http://www.htdig.org/FAQ.html>

This archive was generated by hypermail 2b28 : Wed Nov 22 2000 - 09:46:50 PST