Subject: [htdig] Antw: Re: [htdig] Problems with parse_doc.pl and German Umlaute
From: David Adams (D.J.Adams@soton.ac.uk)
Date: Fri Oct 27 2000 - 08:04:03 PDT
I'm glad that doc2html works OK for you.
$WP2HTML = "";
$WP2HTML = '';
On Thu, 26 Oct 2000 9:23:37 +0200 firstname.lastname@example.org wrote:
> Thanks for your help!
> Your tool works perfectly especially with German Umlaute. The description in the Details-File was very helpfull, so it was no problem for one who has no experience with perl to use doc2html.
> But there is one little annotation for the Details-File. In the install description you write: If you don't have a particular utility then set its location as a null string. For example:
> $WP2HTML = '';
> I don't know but I think you mean $WP2HTML = ""; or?
> Christian Huhn
> >>> <D.J.Adams@soton.ac.uk> 25.10.2000 15.41 Uhr >>>
> > > Hi,
> > > I want to index PDF-Files with German Umlaute (ä, ö, ü, ß). Some tests had shown me that htdig (v. 3.1.5) and xpdf (v. 0.91) are working pretty good with German Umlaute, but the external parser parse_doc.pl has problems with them. It splits words with Umlaute in two words without the Umlaut.
> > For example:
> > > w beim 41 0
> > w diesj 45 0
> > w hrigen 50 0
> > w den 58 0
> > w Platz 62 0
> > > In this case the German word "diesjährigen" is split in "diesj" and "hrigen" and I can find both with htsearch.
> > > Does anyone know how to solve this problem for example with a modified version of parse_doc.pl?
> > > Thanks,
> > > Christian Huhn
> You could try the doc2html parser. I think that the latest version,
> available from the Ht://Dig web site, will not split words this way, but
> I have not tested it thoroughly.
> If doc2html does not parse your .PDF files properly, then email an
> example to me personally, and I'll make sure that the next version of
> doc2html works correctly.
> -- David J Adams
> Computing Services
> University of Southampton
To unsubscribe from the htdig mailing list, send a message to
You will receive a message to confirm this.
List archives: <http://www.htdig.org/mail/menu.html>
This archive was generated by hypermail 2b28 : Fri Oct 27 2000 - 08:09:57 PDT