Some more testing gave the following results:

The german flower words "Stiefmütterchen" and the islandic
"þrenningarfjóla" are treated different in meta content
and in the body or title part of an html document.

When in the body or in the title, the "ü", "þ" and "ó "
are decoded to a one byte character in the .wordlist and .words.db files.

In meta content however, these words are decoded to "stiefmuuml;t"
and "thorn;rennin" in the .wordlist and .words.db file. That is the "&" is
removed and the rest is kept as letters ("&" is in valid_punctuation but
the ";" is not, by default).

Should not they be decoded as the title or body is ?

Lennart Almvist
Museum of Natural History, Stockholm

