Re: [htdig] spell check - python wrapper script


Subject: Re: [htdig] spell check - python wrapper script
From: Gilles Detillieux (grdetil@scrc.umanitoba.ca)
Date: Mon Oct 30 2000 - 14:32:14 PST


According to Geoff Hutchison:
> At 12:44 PM -0400 10/27/00, GregHolmes@aol.com wrote:
> >In case anyone might find this useful, I have attached a python wrapper
> >script that uses ispell to suggest alternatives to search words that may be
> >typos.
>
> This has been one of my thoughts for an excellent
> language-independent Fuzzy class for ht://Dig. Of course as the
> "dictionary," you'd actually use the wordlist itself. This would have
> the dual advantages that you'd have any words not in normal
> dictionaries and the algorithm could also offer up words misspelled
> in the pages themselves. (Perish the thouhgt! [sic])
>
> Of course this idea also got lost in the shuffle. Anyone interested
> in working on this sort of thing (as you have in a sense) would be
> doing us all a great favor.
>
> Thanks for the script!

Yes, neat script! Adapting for Unix is pretty simple. In addition to
the paths, which are pretty obvious, you should use os.popen() instead
of win32pipe.popen(). A slight bug is that for some words, ispell
can suggest two words separated by a space, which the script doesn't
change to a "+" in the query string. That's a simple addition.
I also needed to define my own replace() function, as my version 1.4
python didn't include replace in its string library.

You can implement Geoff's suggestion of building the ispell dictionary
for this script from the wordlist by adding these lines to rundig:

cd $DBDIR
sed -n 's/^\([a-z][a-z]*\) .*/\1/p' db.wordlist | munchlist > wrdlst.0
buildhash -s wrdlst.0 $COMMONDIR/english.aff wrdlst.hash
rm wrdlst.0*

(N.B.: That's a tab character before the .* in sed's regular expression.)

It would be really great if the "speling" fuzzy match algorithm were
setup to use ispell the way the script does to get the alternate words.
Of course in 3.2, you don't have a db.wordlist file, so you'd need a
"htfuzzy speling" command to traverse the word database and feed it
to munchlist, then to buildhash.

-- 
Gilles R. Detillieux              E-mail: <grdetil@scrc.umanitoba.ca>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:    (204)789-3930

------------------------------------ To unsubscribe from the htdig mailing list, send a message to htdig-unsubscribe@htdig.org You will receive a message to confirm this. List archives: <http://www.htdig.org/mail/menu.html> FAQ: <http://www.htdig.org/FAQ.html>



This archive was generated by hypermail 2b28 : Mon Oct 30 2000 - 14:38:29 PST