Re: [htdig] PDF and PostScript Parsing


Gilles Detillieux (grdetil@scrc.umanitoba.ca)
Tue, 23 Feb 1999 11:55:55 -0600 (CST)


According to Patrick Dugal:
> The documentation about pdf_parser says:
>
> description:
> Set this to the path of the program used to parse PDF files, including
> all command-line options. The
> program will be called with the parameters:
> infile outfile,
> where infile is a file to parse and outfile is the **** PostScript output
> of the parser ****. The program is
> supposed to convert to a variant of PostScript, which is then parsed
> internally. Currently, Adobe's
> acroread program and the pdftops program that is part of the xpdf 0.80
> package have been tested as
> pdf_parsers.
>
> The pdf2text software I want to use produces plain text output, not PostScript.
> But the documentation says that the outfile is supposed to be in PostScript. Do
> you think this will be a problem? Does the documentation need to be updated?
>
> Pat :)

I think you may be confusing pdf_parser with external_parsers. Yes,
the output of the pdf_parser must be PostScript, and it must conform to
the style that acroread outputs. You did mention in the first part of
your earlier message that you were asking about the internal parsing of
the acroread output, so I'm assuming you do have acroread, and it was
with acroread that you got all the concatenated strings. Is that right?
If so, please try acroread again, with htdig 3.1.1, to see if that solves
your problem.

The pdftops utility SHOULD produce suitable PostScript for use as a
pdf_parser (though that has been called into question by Joe Jah), but
pdftotext definitely won't do! If you can get pdftops to work for you,
great, but please try acroread with the latest htdig first.

The external_parsers are another story. In the second part of your
earlier message, you asked how you could disable the internal PDF parser,
and use pdftotext instead, without changing the source code. I suggested
using an external parser to override the internal PDF handling. External
parsers require a very specific output format, but the parse_doc.pl script
is capable of generating that output from a number of different document
to text converters. You'd need to customise it to handle PDF files, and
pass them to pdftotext for initial parsing. The perl script would then
parse the text output from this filter, and spit out the records that
htdig expects from an external parser. This is a third alternative,
which is quite different from the other two. You can read up more on
external parsers in the documentation for the external_parsers attribute:

        http://www.htdig.org/attrs.html#external_parsers

Does the documentation need to be updated? Not at the moment. If I
find that pdftops doesn't work as a pdf_parser after all, and we can't
fix either it or PDF.cc to handle it, then this will need to be mentioned
in the documentation. Until then, the documentation is accurate, as far
as I can tell. The pdftotext definitely won't work as a pdf_parser,
but it may be used as part of an external parser, if you customise
contrib/parse_doc.pl (in 3.1.1) to use it.

I hope that clarifies things.

> Gilles Detillieux wrote:
>
> > According to Patrick Dugal:
> > > I've discovered that ht://Dig 3.1.0b1's internal parsing
> > > misbehaves with many PDF documents, although it behaves well
> > > with some. My concern is with the internal parsing of the
> > > acroread output. It's my understanding that the way the
> > > output from acroread is parsed hasn't changed in the new
> > > version of PDF.cc, so this probably also applies to the
> > > newest version of ht://Dig.
> > >
> > > The problem occurs when searching for a word that is
> > > definitely contained in a PDF file which was indexed, the
> > > search results come back with the following snippet, for
> > > example:
> > >
> > > [o98-900.pdf]
> > > ... -
> > > nationsmustbepreparedtosubmitallstructuraldatarequired
> > > tovalidatethediscussiontotheProteinDataBank(Biology
> > > Department,Bldg.463,P.O.Box5000,BrookhavenNational
> > > Laboratory,Upton,NY11973-5000,U.S.A.).Allrelevantnu-
> > > cleicacidsequenceinformationmustbedepositedintheGen-
> > > Bankdatabase(GenBankSubmissions,NationalCenterfor ...
> > > http://mydomain.ca/o98-900.pdf 02/18/99, 46490 bytes
> > >
> > > As you can tell from this snippet, the internal parsing of
> > > the acroread output is not quite what you'd expect. The
> > > strings get concatenated somehow and so the data becomes
> > > nearly useless and impossible to search.
> >
> > Looks like the same problem I had with some PDF files. I fixed it in
> > 3.1.1, so do give that a try. It may very well fix the problem with your
> > files too. My files were generated by Adobe Acrobat PDF Writer, from
> > Corel DRAW files, but the same effect may occur with other file types too.
> > The problem is that sometimes the inter-word spacing is generated by
> > cranking up the character spacing, rather than actually using a space
> > character, or a motion command. The latest version of PDF.cc does try
> > to deal with this, and I'd appreciate further testing by others to make
> > sure my assumptions about the spacing threshold are correct.
> >
> > > Is there a way I can configure htdig to disable the internal
> > > parsing of the acroread output? I'd like to use the
> > > pdftotext program included in the xpdf software to do the
> > > whole conversion from PDF to text and have htdig receive
> > > this file internally in the indexing process. How would I
> > > go about doing that without changing the source code?
> > >
> > > Any of your suggestion would be very helpful.
> >
> > Yup, you can define an external parser, and it should override the
> > internal one. You could use the parse_doc.pl perl script (included
> > in 3.1.1's contrib directory) as a starting point. Add to it a bit
> > of code to recognise the PDF file magic string ("%PDF-" should do it),
> > and call pdftotext to parse the PDF file into text.
> >
> > I'm also going to look into using the pdftops program, included with xpdf,
> > as a PDF parser for use with the internal PDF.cc code. Earlier reports
> > on this list claimed it worked, and that was the reason for moving
> > the acroread-specific options into the pdf_parser attribute. However,
> > Joe Jah just reported yesterday that it doesn't seem to work at all,
> > and he claims to be using the latest version of xpdf. I'll let you know
> > if I get that working.
> >
> > In any case, Patrick, as you seem to have run into the same problem I had,
> > I'd appreciate knowing if my fix to PDF.cc in 3.1.1 solves the problem
> > for you too, when using acroread. So please try that before switching
> > to a different parser. Thanks.
> >
> > --
> > Gilles R. Detillieux E-mail: <grdetil@scrc.umanitoba.ca>
> > Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/~grdetil
> > Dept. Physiology, U. of Manitoba Phone: (204)789-3766
> > Winnipeg, MB R3E 3J7 (Canada) Fax: (204)789-3930
> > ------------------------------------
> > To unsubscribe from the htdig mailing list, send a message to
> > htdig@htdig.org containing the single word "unsubscribe" in
> > the SUBJECT of the message.
>

-- 
Gilles R. Detillieux              E-mail: <grdetil@scrc.umanitoba.ca>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:    (204)789-3930
------------------------------------
To unsubscribe from the htdig mailing list, send a message to
htdig@htdig.org containing the single word "unsubscribe" in
the SUBJECT of the message.



This archive was generated by hypermail 2.0b3 on Fri Feb 26 1999 - 14:34:12 PST