Torsten Neuer (email@example.com)
Fri, 30 Jul 1999 19:55:44 +0200
According to Gilles Detillieux:
>According to Lennart Almkvist:
>> Some more testing gave the following results:
>> The german flower words "Stiefmütterchen" and the islandic
>> "þrenningarfjóla" are treated different in meta content
>> and in the body or title part of an html document.
>> When in the body or in the title, the "ü", "þ" and "ó "
>> are decoded to a one byte character in the .wordlist and .words.db files.
>> In meta content however, these words are decoded to "stiefmuuml;t"
>> and "thorn;rennin" in the .wordlist and .words.db file. That is the "&" is
>> removed and the rest is kept as letters ("&" is in valid_punctuation but
>> the ";" is not, by default).
>> Should not they be decoded as the title or body is ?
>Here's a patch for 3.1.2 that should do what you want. Please give it a
>try and let us know if it fixes this bug.
Something else is going wrong now..
Seems that you strip off one character after the entity, too
somewhere (not everywhere, but in most cases).
e.g. instead of "über" I'll get "üer"
-- InWise - Wirtschaftlich-Wissenschaftlicher Internet Service GmbH Waldhofstraße 14 Tel: +49-4101-403605 D-25474 Ellerbek Fax: +49-4101-403606 E-Mail: firstname.lastname@example.org Internet: http://www.inwise.de
------------------------------------ To unsubscribe from the htdig mailing list, send a message to email@example.com containing the single word unsubscribe in the SUBJECT of the message.
This archive was generated by hypermail 2.0b3 on Fri Jul 30 1999 - 10:16:23 PDT