Subject: RE: [htdig] how to defind word
From: NEPOTE Charles (Neuilly Gestion) (email@example.com)
Date: Mon Jul 10 2000 - 00:36:03 PDT
I found somthing which might interest you.
See CTTeX : General-Purpose Thai word segmentation program
Sources are available at :
And there is a binary version is available for Linux Mandrake (which may
work also on RedHat) ; it will be available soon on a Mandrake mirror :
see : http://www.linux-mandrake.com/en/cookerdevel.php3
Here is the description for the Linux Mandrake RPM :
Name : cttex Relocations: (not relocateable)
Version : 1.21 Vendor: MandrakeSoft
Release : 1mdk Build Date: Thu Jun 29 04:55:09
Install date: (not installed) Build Host:
Group : System/Internationalization Source RPM: (none)
Size : 442255 License: Distributable
Packager : Pablo Saratxaga <firstname.lastname@example.org>
URL : http://thaigate.nacsis.ac.jp/files/index.html
Summary : Cttex, Thai word separator program
The main part of Cttex is A Thai Word Separator algorithm using
a dictionary. A wrapper for formatting Thai LaTeX document file is provided
to demonstrate the use of this word-sep routine. The program can also
be used as a simple word-sep filter.
* Wed Jun 28 2000 Pablo Saratxaga <email@example.com> 1.21-1mdk
- first rpm version for Mandrake
> -----Message d'origine-----
> De : Prisda Gomutputra [mailto:firstname.lastname@example.org]
> Envoyé : samedi 24 juin 2000 19:51
> À : email@example.com
> Objet : [htdig] how to defind word
> I am currently trying to fine tuning Ht://dig to be able to
> work with Thai
> (8bit) language more accurately. I can get it to work fine
> but the accuracy
> of the search is not highly relavent since Thai lanuage does
> not have space
> to separate words. Space is only used to seperate sentences.
> For example, a sentense in English "this is tesRt1. this is
> test2", it would
> be written in thai as follow "thisisteRst1. thisistest2"
> 1) Is there a way to tell ht://dig to be able to identify the
> words and
> index them properly?
> 2) when the words are combided togeter with out space in between, it
> intorduc a new problem such as the example above,
> "thiSISTERst1". When user
> search for a word "sister", "thiSISTERst1" will be returned
> too. is there
> a way to prevent this problem from happening?
> Highly appreciated
> To unsubscribe from the htdig mailing list, send a message to
> You will receive a message to confirm this.
This archive was generated by hypermail 2b28 : Sun Jul 09 2000 - 21:54:40 PDT