Re: [htdig] Search in pdf documents


Subject: Re: [htdig] Search in pdf documents
From: Gilles Detillieux (grdetil@scrc.umanitoba.ca)
Date: Thu May 04 2000 - 07:58:51 PDT


According to Andoni Ayala:
> Works fine the search of accented words in html files, but, in .pdf
> files not work fine.
>
> Example:
>
> in .pdf file: petición
>
> but when i run "rundig" it save in db.wordlist the word "petici"
>
> ¿where are my mistake?

HTML files will generally use ISO-8859-1 (Latin 1) encoding, or SGML
entities which htdig will map to Latin 1, for accented characters.
PDF documents may use any of a number of different encodings. When
you use acroread to parse these, it makes no attempt to remap these
encodings so accents won't show up unless the document happened to
encode everything in Latin 1. When you use pdftotext (from conv_doc.pl
or one of the other external converters or parsers), it will attempt
to remap the various encodings to ISO-8859-1, so it's likely to work
better, but I don't know whether it will always do this correctly.
I think as long as the embedded fonts use standard glyph names, it
should work, but I'm not completely certain about how xpdf/pdftotext
work internally.

-- 
Gilles R. Detillieux              E-mail: <grdetil@scrc.umanitoba.ca>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:    (204)789-3930

------------------------------------ To unsubscribe from the htdig mailing list, send a message to htdig-unsubscribe@htdig.org You will receive a message to confirm this.



This archive was generated by hypermail 2b28 : Thu May 04 2000 - 05:46:03 PDT