Re: [htdig] Problems with parse_doc.pl and German Umlaute


Subject: Re: [htdig] Problems with parse_doc.pl and German Umlaute
From: David Adams (D.J.Adams@soton.ac.uk)
Date: Wed Oct 25 2000 - 06:36:43 PDT


>
> Hi,
>
> I want to index PDF-Files with German Umlaute (ä, ö, ü, ß). Some tests had shown me that htdig (v. 3.1.5) and xpdf (v. 0.91) are working pretty good with German Umlaute, but the external parser parse_doc.pl has problems with them. It splits words with Umlaute in two words without the Umlaut.
> For example:
>
> w beim 41 0
> w diesj 45 0
> w hrigen 50 0
> w den 58 0
> w Platz 62 0
>
> In this case the German word "diesjährigen" is split in "diesj" and "hrigen" and I can find both with htsearch.
>
> Does anyone know how to solve this problem for example with a modified version of parse_doc.pl?
>
> Thanks,
>
> Christian Huhn
>

You could try the doc2html parser. I think that the latest version,
available from the Ht://Dig web site, will not split words this way, but
I have not tested it thoroughly.

If doc2html does not parse your .PDF files properly, then email an
example to me personally, and I'll make sure that the next version of
doc2html works correctly.

-- 
 
David J Adams
<D.J.Adams@soton.ac.uk>
Computing Services
University of Southampton

------------------------------------ To unsubscribe from the htdig mailing list, send a message to htdig-unsubscribe@htdig.org You will receive a message to confirm this. List archives: <http://www.htdig.org/mail/menu.html> FAQ: <http://www.htdig.org/FAQ.html>



This archive was generated by hypermail 2b28 : Wed Oct 25 2000 - 06:42:25 PDT