Re: htdig: htfuzzy - endings runs VERY long


Andrew Scherpbier (andrew@contigo.com)
Thu, 26 Nov 1998 10:11:58 -0800


Alexander Bergolth wrote:
>
> Hallo Frank!
>
> On Thu, 26 Nov 1998, Frank Richter wrote:
>
> > I'm building a database for "ending" search algorithm with a German
> > dictionary and rule set. The dictionary has 40794 lines.
>
> Hehe! Have much fun!
>
> > I started running (from 3.1.0b2)
> > % htfuzzy -v -c htfuzzy-de.conf endings
> > yesterday. The first 20000 lines it did in a few minutes, but after ca.
> > 14 hours it is here:
> > htfuzzy/endings: words: 27900
> >
> > I saw the same with 3.0.8b2, using a smaller dictionary (25000 lines), so
> > this is probably not a new problem.
>
> I didn't debug htfuzzy but I ran htfuzzy with a 76087 words input-file and
> it took about 3 weeks or so on a brand new RS/6000 dual processor machine
> to build the dictionary. (The first 50000 words took a few minutes, then
> it slowed down dramatically.)
>
> I don't know if the db-files are binary compatible but you can have my
> -rw-rw-r-- 1 bergolth edvz 7310336 Aug 21 07:43 root2word.db
> -rw-rw-r-- 1 bergolth edvz 13724672 Aug 21 07:43 word2root.db
> files and try it with them...
>
> http://strike.wu-wien.ac.at/~leo/htdig/root2word.db
> http://strike.wu-wien.ac.at/~leo/htdig/word2root.db
>
> Bye,
> Leo

This is actually a pretty serious problem. Can someone run 'truss' or
'strace' or whatever on htfuzzy when it is generating the endings databases
to see if it does anything really inefficient?
There are two factors that make the endings database generation slow:
1) the underlying database does *a lot* of queries (hence disk IO)
2) the regular expressions used to parse the .aff rules

In the past, to solve #1 above, I have explicitly generated the endings db
files on a RAM drive (/tmp under solaris) This helped. I have not had any
extreem slowness on any of the linux machines (without ram drives) that I've
been using for ht://Dig, though.

If the databases are not platform neutral (I know the GDBM files weren't,
don't know about db2) maybe we should have a repository of the ASCII format
of those files and have a little tool that converts those to the database
files.

thoughs?

-- 
Andrew Scherpbier <andrew@contigo.com>
Contigo Software <http://www.contigo.com/>
----------------------------------------------------------------------
To unsubscribe from the htdig mailing list, send a message to
htdig-request@sdsu.edu containing the single word "unsubscribe" in
the body of the message.



This archive was generated by hypermail 2.0b3 on Sat Jan 02 1999 - 16:28:54 PST