htdig patch (kind of) for ISO_8859_2


Iosif Fettich (ifettich@netsoft.ro)
Sat, 31 Jan 1998 21:21:10 +0200 (EET)


Hello,

first of all: thanks for htDig! I think it's a really nice tool.
We started to use it on one of the sites we are hosting,
www.pcconcrete.ro.
Here the patch I applied - we needed it - and some comments.
-------------------------------------------------
diff SGMLEntities.cc SGMLEntities.cc.old
161,162d160
< unsigned char x;
<
174,185c172
< //PATCH to make romanian ISO_8859_2 chars fit into plain ASCII//
< x = atoi (entity + 1);
< if (x == 227 || x == 226) return 'a';
< if (x == 195 || x == 194) return 'A';
< if (x == 238) return 'i';
< if (x == 206) return 'I';
< if (x == 186) return 's';
< if (x == 170) return 'S';
< if (x == 254) return 't';
< if (x == 222) return 'T';
< //END OF PATCH
< return x;

---
>       return atoi(entity + 1);

--------------------------------------- I made it 'quick and dirty' - the correct way to solve the underlying problem will be established by you, I hope ;) Here the story about what the patch is solving.

In romanian, there are three kinds of 'a': - simple a - a circumflex - a breve They are pronounced different, but when proper representation isn't possible (ASCII email, for instance...) they usually are defaulted to 'a'. As we wish to display our html pages using the correct representations of chars (use of ISO-8859-2 is now usually possible), we will build our HTML pages actually using the correct representations.

When a visitor is using the search engine, however, there is no guarantee that he will be able to use his keyboard in order to type a regular ISO-8859-2 character. That means that he often wouldn't be able to type a correct search string.

We decided to be correct in displaying info, but defaulting to plain ASCII when indexing the documents. Searches for strings actually using specific characters may easily be directed not to use other than ASCII.

So a search for 'Romania' will find documents containing both Romania or Rom&#226;nia - that's what we wish. The found documents will show up the way they are, Romania or Rom&#226;ia, highligting each of these words. That's handy and rather nice.

A search for Rom&#226nia will find nothing - that's not quite elegant, but ways better than not finding Rom&#226nia when searching for Romania.

Hope you understand what I tried to explain. Would be nice if you'd provide a way to configure this kind of stuff too, in the next release.

Another note: maximum_pages isn't documented. no big deal, but maybe you care about that too.

We are using htdig-3.0.8b1 on a Linux box.

Thanks again,

Iosif Fettich

----------------------------------------------------------------------- Iosif Fettich | e-mail: ifettich@netsoft.ro ICQ UIN: 5496730 Mng. Director | phone/fax: +40-(0)65-162614 NetSoft SRL | mail: NetSoft SRL,4300 Tg.Mures,O.P.1-C.P.172,Romania



This archive was generated by hypermail 2.0b3 on Sat Jan 02 1999 - 16:25:33 PST