Re: [htdig] acroread and iso-latin-1?


Antti Rauramo (antti.rauramo@edita.fi)
Mon, 08 Feb 1999 15:23:53 +0200


I wrote:

> Hi all!
>
> I'm having bit of a trouble indexing pdf-documents with htdig. Seems
> that acroread -toPostScript converts the iso-latin characters (ń÷ň─Í┼
> etc) in a way htdig doesn't recognize. I can make a perl wrapper that'd
> convert the characters acroread makes to the normal format, but htdig
> doesn't seem to recognize find them anyway. With html documents
> everything works out fine.

Ookay, quoting on myself... To reform the question and maybe get an answer:

How does htdig want the iso-latin characters in PostScript files to
recognize them?

It seems that the characters are from the mac charset, since the pdf's were

made with macs. I've a wrapper that runs acroread and converts the
characters to the PC format, and now they show up ok in the $(EXCERPT), but

searching them produces no hits on the pdf-files. Again all ok on the
html-pages.

What acroread (and pdftops for that matter) produce is \202, which maps to
a weird character in the excerpt. I've tried \344, ń, ä, but none
seems to work. To first ones are correct in the excerpt.

Htmerge -v reveals that htdig has not correctly mapped the characters.

Now I'm trying to add 0212 -> 'ń' etc to PDF::addToString PDF.cc, still
with no luck on searching, excerpt ok, but not hits.

Running ht://Dig 3.1.0b4 on Solaris 2.6, with acroread v. 3.01 for Unix.
LC_CTYPE, or locale, is iso_8859_1.

--
- Antti Rauramo, WWW- ja tietokanta-asiantuntija, Edita Verkkoviestintń
- antti.rauramo@edita.fi, +358-9-8501 4004 (mobile)

------------------------------------ To unsubscribe from the htdig mailing list, send a message to htdig@htdig.org containing the single word "unsubscribe" in the SUBJECT of the message.



This archive was generated by hypermail 2.0b3 on Wed Feb 10 1999 - 17:09:05 PST