Re: [htdig] PDF & ISO-Latin chars


Gilles Detillieux (grdetil@scrc.umanitoba.ca)
Thu, 12 Aug 1999 14:50:59 -0500 (CDT)


According to Antti Rauramo:
> Anyone out there indexing pdf-files with ISO-Latin characters in them
> ( mainly)? Seems that htdig doesn't understand the meaning of the
> special characters, and shows them w/o conversion; thus '' in a
> document made with a mac shows as '' on all browsers etc. Viewing ps's
> converted from pdf's with acroread the same way htdig does displays the
> characters correctly.
>
> Try http://www.vero.fi/cgi-bin/verohaku?config=vero_kehitys, search for
> "huoneiston", see the excerpt, compare it to the same spot in the
> document. 3011.pdf is made with a PC, 780.pdf with a Mac.

The problem is that the Mac is in a world of its own when it comes to
character sets. Your Mac-made PDF uses a different encoding than the
ISO-8859-1 (Latin 1) encoding that htdig expects to find in the fi_FI
locale. Unfortunately, htdig's simple PDF PostScript parser doesn't
look at the encoding, just the text strings, so if they're not in the
Latin 1 encoding, the non-ASCII characters won't make sense.

On the other hand, pdftotext (part of the xpdf package) seems to handle
encoding conversion just fine. I was able to extract what looked to my
admittedly untrained eye to be Finnish text, with accents. You might
have better luck with the parse_doc.pl script as an external parser
for your PDFs. You should use version 0.90 of xpdf, rather than 0.80,
as it won't have to be patched to work properly. Also, as your 780.pdf
is in a two column format, you'll need to edit parse_doc.pl to use the
-raw option, to separate the columns when indexing. (The commented line
in parse_doc.pl says -rawdump, but that was for patched 0.80 source;
0.90 uses -raw.)

See http://www.htdig.org/FAQ.html#q4.9 for more information.

You may want to adapt the script to extract titles from PDFs using
pdfinfo, if the titles matter to you. (That's something on my to-do
list I can't seem to find the time for.)

-- 
Gilles R. Detillieux              E-mail: <grdetil@scrc.umanitoba.ca>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:    (204)789-3930

------------------------------------ To unsubscribe from the htdig mailing list, send a message to htdig@htdig.org containing the single word unsubscribe in the SUBJECT of the message.



This archive was generated by hypermail 2.0b3 on Thu Aug 12 1999 - 12:51:50 PDT