Re: [htdig] PDF and PostScript Parsing


Gilles Detillieux (grdetil@scrc.umanitoba.ca)
Tue, 23 Feb 1999 10:43:35 -0600 (CST)


According to Patrick Dugal:
> I've discovered that ht://Dig 3.1.0b1's internal parsing
> misbehaves with many PDF documents, although it behaves well
> with some. My concern is with the internal parsing of the
> acroread output. It's my understanding that the way the
> output from acroread is parsed hasn't changed in the new
> version of PDF.cc, so this probably also applies to the
> newest version of ht://Dig.
>
> The problem occurs when searching for a word that is
> definitely contained in a PDF file which was indexed, the
> search results come back with the following snippet, for
> example:
>
> [o98-900.pdf]
> ... -
> nationsmustbepreparedtosubmitallstructuraldatarequired
> tovalidatethediscussiontotheProteinDataBank(Biology
> Department,Bldg.463,P.O.Box5000,BrookhavenNational
> Laboratory,Upton,NY11973-5000,U.S.A.).Allrelevantnu-
> cleicacidsequenceinformationmustbedepositedintheGen-
> Bankdatabase(GenBankSubmissions,NationalCenterfor ...
> http://mydomain.ca/o98-900.pdf 02/18/99, 46490 bytes
>
> As you can tell from this snippet, the internal parsing of
> the acroread output is not quite what you'd expect. The
> strings get concatenated somehow and so the data becomes
> nearly useless and impossible to search.

Looks like the same problem I had with some PDF files. I fixed it in
3.1.1, so do give that a try. It may very well fix the problem with your
files too. My files were generated by Adobe Acrobat PDF Writer, from
Corel DRAW files, but the same effect may occur with other file types too.
The problem is that sometimes the inter-word spacing is generated by
cranking up the character spacing, rather than actually using a space
character, or a motion command. The latest version of PDF.cc does try
to deal with this, and I'd appreciate further testing by others to make
sure my assumptions about the spacing threshold are correct.

> Is there a way I can configure htdig to disable the internal
> parsing of the acroread output? I'd like to use the
> pdftotext program included in the xpdf software to do the
> whole conversion from PDF to text and have htdig receive
> this file internally in the indexing process. How would I
> go about doing that without changing the source code?
>
> Any of your suggestion would be very helpful.

Yup, you can define an external parser, and it should override the
internal one. You could use the parse_doc.pl perl script (included
in 3.1.1's contrib directory) as a starting point. Add to it a bit
of code to recognise the PDF file magic string ("%PDF-" should do it),
and call pdftotext to parse the PDF file into text.

I'm also going to look into using the pdftops program, included with xpdf,
as a PDF parser for use with the internal PDF.cc code. Earlier reports
on this list claimed it worked, and that was the reason for moving
the acroread-specific options into the pdf_parser attribute. However,
Joe Jah just reported yesterday that it doesn't seem to work at all,
and he claims to be using the latest version of xpdf. I'll let you know
if I get that working.

In any case, Patrick, as you seem to have run into the same problem I had,
I'd appreciate knowing if my fix to PDF.cc in 3.1.1 solves the problem
for you too, when using acroread. So please try that before switching
to a different parser. Thanks.

-- 
Gilles R. Detillieux              E-mail: <grdetil@scrc.umanitoba.ca>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:    (204)789-3930
------------------------------------
To unsubscribe from the htdig mailing list, send a message to
htdig@htdig.org containing the single word "unsubscribe" in
the SUBJECT of the message.



This archive was generated by hypermail 2.0b3 on Fri Feb 26 1999 - 14:34:12 PST