Antti Rauramo (antti.rauramo@edita.fi)
Mon, 08 Feb 1999 15:23:53 +0200
I wrote:
> Hi all!
>
> I'm having bit of a trouble indexing pdf-documents with htdig. Seems
> that acroread -toPostScript converts the iso-latin characters (äöåÄÖÅ
> etc) in a way htdig doesn't recognize. I can make a perl wrapper that'd
> convert the characters acroread makes to the normal format, but htdig
> doesn't seem to recognize find them anyway. With html documents
> everything works out fine.
Ookay, quoting on myself... To reform the question and maybe get an answer:
How does htdig want the iso-latin characters in PostScript files to
recognize them?
It seems that the characters are from the mac charset, since the pdf's were
made with macs. I've a wrapper that runs acroread and converts the
characters to the PC format, and now they show up ok in the $(EXCERPT), but
searching them produces no hits on the pdf-files. Again all ok on the
html-pages.
What acroread (and pdftops for that matter) produce is \202, which maps to
a weird character in the excerpt. I've tried \344, ä, ä, but none
seems to work. To first ones are correct in the excerpt.
Htmerge -v reveals that htdig has not correctly mapped the characters.
Now I'm trying to add 0212 -> 'ä' etc to PDF::addToString PDF.cc, still
with no luck on searching, excerpt ok, but not hits.
Running ht://Dig 3.1.0b4 on Solaris 2.6, with acroread v. 3.01 for Unix.
LC_CTYPE, or locale, is iso_8859_1.
-- - Antti Rauramo, WWW- ja tietokanta-asiantuntija, Edita Verkkoviestintä - antti.rauramo@edita.fi, +358-9-8501 4004 (mobile)------------------------------------ To unsubscribe from the htdig mailing list, send a message to htdig@htdig.org containing the single word "unsubscribe" in the SUBJECT of the message.
This archive was generated by hypermail 2.0b3 on Wed Feb 10 1999 - 17:09:05 PST