loic@ceic.com
Tue, 5 Oct 1999 19:01:43 +0200 (MEST)
Geoff Hutchison writes:
>
> Actually, these are bugs in the Retriever code. Because much of it
> still assumes there's a separate wordlist file that gets merged into
> a word db, so these things happen. But that's my point--it's a *bug*,
> not a requirement to run htmerge.
I agree.
> One of the reasons for redesigning the databases was to allow
> searches on databases being updated. Granted, this means the
> databases may have data that needs to be purged, but searches should
> still work. The only advantage of running htmerge would be to purge
> words from deleted documents. (This is one of my reasons for wanting
> to make an httools directory, but that will have to come later.)
>
> Does this make sense?
Yes. I did not find an easy/efficient way to fix the bug related to
inserting words for documents not yet visited. The easy solution with
the current WordList code would be to WordList::WalkDelete with document id
whenever a document is visited and stated 'not found'. Unfortunately this
will be very inefficient when the database grows because the word database
is indexed primarily on the word, not the document id. There is no way
to make that fast unless you want to double the size of the index.
The search procedure should discard references to documents that are
not found or otherwise invalid but apparently it's not done at present
(I've not investigated this part however).
-- Loic DacharyECILA 100 av. du Gal Leclerc 93500 Pantin - France Tel: 33 1 56 96 09 80, Fax: 33 1 56 96 09 61 e-mail: Loic@Dachary.org URL: http://www.senga.org/
------------------------------------ To unsubscribe from the htdig3-dev mailing list, send a message to htdig3-dev@htdig.org containing the single word "unsubscribe" in the SUBJECT of the message.
This archive was generated by hypermail 2.0b3 on Tue Oct 05 1999 - 09:55:30 PDT