Re: [htdig] PDF and PostScript Parsing


Patrick Dugal (patrick.dugal@nrc.ca)
Wed, 24 Feb 1999 12:29:12 -0500


I have upgraded to ht://Dig 3.1.1 and I have been testing it with respect to PDF
performance. The indexing of PDF files still doesn't work for me. There are many
PDF's that the organization I work for wants to index but even the new version of
ht://Dig still misbehaves with PDF files, although some pdf files are indexed
properly. Htsearch outputs concatenated results, as in:

[o98-900.pdf]
     ... -
nationsmustbepreparedtosubmitallstructuraldatarequired
tovalidatethediscussiontotheProteinDataBank(Biology
     Department,Bldg.463,P.O.Box5000,BrookhavenNational
Laboratory,Upton,NY11973-5000,U.S.A.).Allrelevantnu-
     cleicacidsequenceinformationmustbedepositedintheGen-
Bankdatabase(GenBankSubmissions,NationalCenterfor ...
     http://mydomain.ca/o98-900.pdf 02/18/99, 46490 bytes

I have tried writing my own pdf2text PERL program which parses the output from
acroread. But I later found out that unless I was prepared to read a lot about PDF
files, I wasn't going to be very successful. I think the people who programed
xpdf's pdftotext really took the time to understand many massive documents about
PDF files. I don't want to recreate the wheel. The only successful parser I have
seen work on pdf's is pdf2text.

I looked at PDF.cc from 3.1.1 and it doesn't appear to have changed the way it
parses the output from the acroread since a few versions back. I tried setting
external_parsers to a tweaked version of htparsedoc which invokes xpdf's pdftotext,
but pdftotext doesn't seem to be receiving the name of the file to parse as an
argument. Therefore, the parser outputs useless info, and the data about the pdf
doesn't enter the database at all. How can this be fixed?

I will do some more testing and I will update you with my findings.

Pat :)

Gilles Detillieux wrote:

> I've followed up privately with Patrick, but for the benefit of others
> on the list, I'll give my most recent findings here.
>
> According to me:
> > According to Patrick Dugal:
> > > As you can tell from this snippet, the internal parsing of
> > > the acroread output is not quite what you'd expect. The
> > > strings get concatenated somehow and so the data becomes
> > > nearly useless and impossible to search.
> >
> > Looks like the same problem I had with some PDF files. I fixed it in
> > 3.1.1, so do give that a try. It may very well fix the problem with your
> > files too. My files were generated by Adobe Acrobat PDF Writer, from
> > Corel DRAW files, but the same effect may occur with other file types too.
> > The problem is that sometimes the inter-word spacing is generated by
> > cranking up the character spacing, rather than actually using a space
> > character, or a motion command. The latest version of PDF.cc does try
> > to deal with this, and I'd appreciate further testing by others to make
> > sure my assumptions about the spacing threshold are correct.
> >
> > > Is there a way I can configure htdig to disable the internal
> > > parsing of the acroread output? I'd like to use the
> > > pdftotext program included in the xpdf software to do the
> > > whole conversion from PDF to text and have htdig receive
> > > this file internally in the indexing process. How would I
> > > go about doing that without changing the source code?
> > >
> > > Any of your suggestion would be very helpful.
> >
> > Yup, you can define an external parser, and it should override the
> > internal one. You could use the parse_doc.pl perl script (included
> > in 3.1.1's contrib directory) as a starting point. Add to it a bit
> > of code to recognise the PDF file magic string ("%PDF-" should do it),
> > and call pdftotext to parse the PDF file into text.
>
> I don't know how well pdftotext will work as part of an external parser.
> I just tried pdftotext myself on one of the documents that had given me
> the concatenation problem in earlier versions of htdig. To solve this
> concatenation problem, you need something that can handle the silly
> character spacing in some PDF files. That means your best bet is to
> use acroread as your pdf_parser, with the latest version of htdig.
>
> > I'm also going to look into using the pdftops program, included with xpdf,
> > as a PDF parser for use with the internal PDF.cc code. Earlier reports
> > on this list claimed it worked, and that was the reason for moving
> > the acroread-specific options into the pdf_parser attribute. However,
> > Joe Jah just reported yesterday that it doesn't seem to work at all,
> > and he claims to be using the latest version of xpdf. I'll let you know
> > if I get that working.
>
> I can confirm that pdftops from xpdf 0.80 won't work as a pdf_parser
> with htdig. It still does NOT produce BT and ET tags, so PDF.cc just
> skims through the PostScript output from pdftops without indexing anything.
>
> Can those who claimed it did work please let us know what they modified
> to get it to work?
>
> --
> Gilles R. Detillieux E-mail: <grdetil@scrc.umanitoba.ca>
> Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/~grdetil
> Dept. Physiology, U. of Manitoba Phone: (204)789-3766
> Winnipeg, MB R3E 3J7 (Canada) Fax: (204)789-3930
> ------------------------------------
> To unsubscribe from the htdig mailing list, send a message to
> htdig@htdig.org containing the single word "unsubscribe" in
> the SUBJECT of the message.

------------------------------------
To unsubscribe from the htdig mailing list, send a message to
htdig@htdig.org containing the single word "unsubscribe" in
the SUBJECT of the message.



This archive was generated by hypermail 2.0b3 on Fri Feb 26 1999 - 14:34:12 PST