Re: [htdig] PDF Parsing Errors


Gilles Detillieux (grdetil@scrc.umanitoba.ca)
Tue, 10 Aug 1999 14:43:38 -0500 (CDT)


According to Jeff Kirk:
> I was wodering of anyone can help me with a small problem. I just
> installed the pdf parsing ability on our installation of htdig.
> Actually, it's a new server running RHL 6.0, 256MB mem, 9 GB hd, etc. I
> installed and compiled the current version of htdig and acroread 4.
> When documents are created with the distiller I get "Segmentation
> Violation Caught", but when I use the PDF Writer it works fine. At the
> shell prompt acroread converts the documents correctly, so it seems to
> be something with htdig. Anyone have any idea???

This seems to be a bug in Acrobat 4. You may want to report it to Adobe.
It works fine with Acrobat 3. The problem seems to be caused by the
-pairs option. If I manually run

        acroread -toPostScript -pairs profile_rob_98.pdf /tmp/t.ps

to convert one of my PDFs, Acrobat 4 gives the error

        profile_rob_98.pdf: Segmentation Violation Caught.

but with Acrobat 3, it works fine. However, if I try

        acroread -toPostScript profile_rob_98.pdf /tmp

it works fine with either version. You may want to revert to Acrobat 3.
Another fix is to write a script to call acroread, stripping the last
filename component from the last argument, and removing the -pairs option,
either in the script or in your pdf_parser attribute in htdig.conf.

> Number 2 is related to document titles. In Word 97, I tell the document
> a title and subject expecting it to parse into PDF title and subject.
> But when I dig and run a search on the web site to test it the resulting
> title, it is like this:=FE=FF Obviously not what I want. Any
> suggestions?

PDF titles seem to be a pain in general. Distiller doesn't grab titles
from other applications to put into the PDF. (I don't think it can,
because from the application's point of view, it's just talking to a
printer driver.) All you can do is get into the habit of entering the
title into Distiller's dialog box whenever you create a PDF. On our site,
nobody bothered to do this, so when I adapted the parse_doc.pl script
to work as an external parser for PDFs, I never bothered to do anything
with titles, and instead just put out a title that includes the file name
(just as the script did with other document types).

With acroread, you don't have much choice - you get what it spits out
in the PostScript output. I don't know why you're getting garbage
characters, though. Perhaps they were entered accidentally?

-- 
Gilles R. Detillieux              E-mail: <grdetil@scrc.umanitoba.ca>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:    (204)789-3930

------------------------------------ To unsubscribe from the htdig mailing list, send a message to htdig@htdig.org containing the single word unsubscribe in the SUBJECT of the message.



This archive was generated by hypermail 2.0b3 on Tue Aug 10 1999 - 12:44:43 PDT