Re: htdig: Possible htsearch bug


Didier Gautheron (dgautheron@magic.fr)
Wed, 25 Nov 1998 22:25:20 +0000


George Adams wrote:
>
>
> >db.words.db should be generated from scratch from db.wordlist by htmerge.
> >I'm assuming the word is actually in db.wordlist?
>
> No, actually that's not the case.
>
> Here is the state after indexing the site while the file containing the keyword "dalek" DOES exist:
>
> % ls -l
> -rw-rw-r-- 1 adams users 83968 Nov 25 10:20 db.docdb
> -rw-rw-r-- 1 adams users 6144 Nov 25 10:20 db.docs.index
> -rw-rw-r-- 1 adams users 109895 Nov 25 10:20 db.wordlist
> -rw-rw-r-- 1 adams users 117760 Nov 25 10:20 db.words.db
>
> % grep -l "dalek" *
> db.docdb
> db.wordlist
> db.words.db
>
> Now I remove the file containing the word "dalek" and reindex the site by running "rundig".
>
> % ls -l
> -rw-rw-r-- 1 adams users 83968 Nov 25 10:21 db.docdb
> -rw-rw-r-- 1 adams users 6144 Nov 25 10:21 db.docs.index
> -rw-rw-r-- 1 adams users 109740 Nov 25 10:21 db.wordlist
> -rw-rw-r-- 1 adams users 117760 Nov 25 10:21 db.words.db
>
> % grep -l "dalek" *
> db.words.db

Yes in htsearch/words.cc mergeWords() only opens or creates db.words.db,
so deleted words aren't removed, whatever remove_bad_urls setting is and
as a matter of fact, remove_bad_urls isn't involved here ie:.
htdig -i one URL index.html with foo (one occurrence).

remove foo in index.html
htdig
htmerge
foo isn't removed and it's not a bad url!

I was puzzled by this stuff in a nasty way. I played with locale setting
and was plagued with htsearch finding réseaux and seaux (this one was
created without locale:fr ). I got ghost hits without changing
documents!

I thing it's a htdig-3.0.8b2 bug too.

a dirty hack:
unlink db.words.db first or use -a options!

Didier

----------------------------------------------------------------------
To unsubscribe from the htdig mailing list, send a message to
htdig-request@sdsu.edu containing the single word "unsubscribe" in
the body of the message.



This archive was generated by hypermail 2.0b3 on Sat Jan 02 1999 - 16:28:53 PST