Re: [htdig] Newbie question on excerpts from PDFs


Subject: Re: [htdig] Newbie question on excerpts from PDFs
From: Gilles Detillieux (grdetil@scrc.umanitoba.ca)
Date: Thu Sep 07 2000 - 11:16:40 PDT


According to Sue Moffitt:
> Our site has many PDFs and on searching with htdig many hits come back with
> rubbish as an excerpt. What these particular PDFs seem to have in common is
> Custom embedded fonts. Is there any way of getting around this problem and
> getting readable excerpts.

That depends on how you're indexing your PDFs. If you're using acroread,
I'd recommend trying doc2html with pdftotext instead. If you're already
using an external parser or converter, maybe give acroread 3.0 a try instead.

See http://www.htdig.org/FAQ.html#q4.9

Embedded fonts can be a problem, because there's no guarantee they'll use
standard encodings or even standard glyph names, so there may not be any
way of getting intelligible text out of these documents other than with
your own eyeballs.

-- 
Gilles R. Detillieux              E-mail: <grdetil@scrc.umanitoba.ca>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:    (204)789-3930

------------------------------------ To unsubscribe from the htdig mailing list, send a message to htdig-unsubscribe@htdig.org You will receive a message to confirm this. List archives: <http://www.htdig.org/mail/menu.html> FAQ: <http://www.htdig.org/FAQ.html>



This archive was generated by hypermail 2b28 : Thu Sep 07 2000 - 11:18:35 PDT