Subject: Re: [htdig3-dev] htdig Czech edition
From: Geoff Hutchison (ghutchis@wso.williams.edu)
Date: Tue Mar 07 2000 - 06:16:41 PST
At 8:48 AM +0100 3/7/00, Martin Povolny wrote:
>For some languages indexing over word roots makes much better sence than
>over whole words this is absolutly true for Czech.
>So we have experimented with lemma (commercial) and ajka (almost finished GNU)
>lemmatization software to get word roots, finaly we took out part of ispell --
>access to the hash and used this becouse it can be used also with other
>languages (but it knows much fewer word forms than the other two).
We'd be interested to see how you've done this. As for ispell, I
don't know offhand how you write the affix files, but it's definitely
possible to add more word forms to it. I know the German ispell files
are quite complete.
>At present we're trying to index out faculty's web, but it seems that
>e algorithm htdig uses for creation of the inverted file is too naive --
>seems to me like it's tryning to apply unix 'sort' on a 1GB file...
That is what it's doing. If you're concerned about it, I'd switch to
the 3.2 code, which builds the inverted index on-the-fly during
indexing.
Cheers,
-Geoff
------------------------------------
To unsubscribe from the htdig3-dev mailing list, send a message to
htdig3-dev-unsubscribe@htdig.org
You will receive a message to confirm this.
This archive was generated by hypermail 2b28 : Tue Mar 07 2000 - 06:22:46 PST