[htdig] Problems with parse_doc.pl and German Umlaute

Subject: [htdig] Problems with parse_doc.pl and German Umlaute
From: thch@techem.de
Date: Wed Oct 25 2000 - 05:54:18 PDT


I want to index PDF-Files with German Umlaute (ä, ö, ü, ß). Some tests had shown me that htdig (v. 3.1.5) and xpdf (v. 0.91) are working pretty good with German Umlaute, but the external parser parse_doc.pl has problems with them. It splits words with Umlaute in two words without the Umlaut.
For example:

w beim 41 0
w diesj 45 0
w hrigen 50 0
w den 58 0
w Platz 62 0

In this case the German word "diesjährigen" is split in "diesj" and "hrigen" and I can find both with htsearch.

Does anyone know how to solve this problem for example with a modified version of parse_doc.pl?


Christian Huhn

To unsubscribe from the htdig mailing list, send a message to
You will receive a message to confirm this.
List archives: <http://www.htdig.org/mail/menu.html>
FAQ: <http://www.htdig.org/FAQ.html>

This archive was generated by hypermail 2b28 : Wed Oct 25 2000 - 05:59:14 PDT