[htdig] how to defind word


Subject: [htdig] how to defind word
From: NEPOTE Charles (Neuilly Gestion) (charles.nepote@cetelem.fr)
Date: Mon Jun 26 2000 - 01:24:21 PDT


This is a quite interesting problem which shows "occidental" cultural
conventions are not universal all over the wold. (See my answers at the
end).

According to Prisda Gomutputra :

> I am currently trying to fine tuning Ht://dig to be able to
> work with Thai
> (8bit) language more accurately. I can get it to work fine
> but the accuracy
> of the search is not highly relavent since Thai lanuage does
> not have space
> to separate words. Space is only used to seperate sentences.
>
> For example, a sentense in English "this is tesRt1. this is
> test2", it would
> be written in thai as follow "thisisteRst1. thisistest2"
> ^^^^
> 1) Is there a way to tell ht://dig to be able to identify the
> words and
> index them properly?
> 2) when the words are combided togeter with out space in between, it
> intorduc a new problem such as the example above,
> "thiSISTERst1". When user
> search for a word "sister", "thiSISTERst1" will be returned
> too. is there
> a way to prevent this problem from happening?

How can you make the difference between "thiSISTERst1" and "thisisTERST2" ?
Is this the global sence of the sentence which allows you to decide how to
understand "thisisterst1" ?
Are there some ambigous sentences (where it is difficult to decide the sence
of "thisisterst") ?
Is there a way to make clearly the difference between "thiSISTERst1" and
"thisisTERST2" ?

I think a solution is to insert (manually or automatically) an "invisible"
space between "this" "is" and "terst". I mean a character which won't be
shown when you read, but which will be understood by softwares (such
ht://dig) as a separation between to words. (Also think about html-
sgml-like markup : for example :
<word>this</word><word>is</word><word>terst</word>).
-- Manually : it may be long, and difficult to change the usual way of
writing.
-- Automatically : you may use or build a software that analyse every
sentences to add "invisible" spaces between words -- I don't know if such a
software exist.

Another theoretical solution, less elegant but immediatly possible, is to
use synonyms in ht://Dig :
"thiSISTERst" should have "this" "is" "terst" as synonyms.

I Hope you will find a solution.
Charles NÚpote.
Paris, France.

> Highly appreciated
> Prisda



This archive was generated by hypermail 2b28 : Sun Jun 25 2000 - 22:40:20 PDT