Re: [htdig] PDF parsing in htdig/PDF.cc


Gilles Detillieux (grdetil@scrc.umanitoba.ca)
Wed, 24 Feb 1999 15:12:02 -0600 (CST)


According to Patrick Dugal:
> I have tried writing my own pdf2text PERL program which parses the output from
> acroread. But I later found out that unless I was prepared to read a lot about
> PDF
> files, I wasn't going to be very successful. I think the people who programed
> xpdf's pdftotext really took the time to understand many massive documents about
> PDF files. I don't want to recreate the wheel. The only successful parser I have
> seen work on pdf's is pdf2text.

Well, if you're going to work with a PDF to PS converter, like acroread, or
xpdf's pdftops utility, then the best place for changes is right in
htdig/PDF.cc, rather than reinventing the wheel with an external parser.

On the other hand, if you're going to work with a PDF to text converter,
like xpdf's pdftotext, or Ghostscript's pdf2text (or is it pdf2ascii?), then
an external parser is probably the way to go. This is the approach used
for MS Word files (using catdoc) and PostScript files (using gs's ps2ascii).

> I looked at PDF.cc from 3.1.1 and it doesn't appear to have changed the way it
> parses the output from the acroread since a few versions back. I tried setting
> external_parsers to a tweaked version of htparsedoc which invokes xpdf's pdftotext,
> but pdftotext doesn't seem to be receiving the name of the file to parse as an
> argument. Therefore, the parser outputs useless info, and the data about the pdf
> doesn't enter the database at all. How can this be fixed?

Sounds like you weren't passing the right arguments to the parser. Try the
patch to parse_doc.pl that I posted a little earlier today. I haven't tested
it thoroughly, but it seems to produce the required output.

In another message, Patrick said:
> Has anyone had any problems with xpdf's pdftotext (with decryption patch)? Maybe
> the PDF.cc could solely rely on pdftotext instead of acroread and it's internal
> parsing? I have tested pdftotext with many pdf's and it seems to work so far on
> all the ones PDF.cc failed on.

As I reported earlier, I still have a problem with it. If you try it on
this document:

        http://www.scrc.umanitoba.ca/SCRC/profile/profile_rob_98.pdf

you'll see what I mean. pdftotext spits out concatenated words from this
document.

> According to the xpdf README, many documents from Adobe were consulted when
> pdftotext was written. I think that the value of making PDF.cc use pdftotext would
> represent a significant improvement.

Yes, I have no trouble believing that they did their homework. It looks
like very clean code, from what I've seen of it. Unfortunately,
it doesn't prevent the software that generates the PDFs from doing
something stupid. I think that's the problem I ran into, and it's
probably Corel DRAW's fault. However, my recent changes to PDF.cc,
to handle the large Tc character spacing that Corel uses for separating
words, fixes the problem for me. Unfortunately, the concatenation problem
that you have seems to have a different cause. PDF.cc currently ignores
the numbers in the array given to the the TJ command, which it shouldn't.
I'll have to see how xdpf deals with them, and hopefully do something
sensible in PDF.cc as well.

> Has anyone tried to tweak and test PDF.cc so that it relies solely on pdftotext?
> If not, I will and let the list know if there is any significant improvement.

As I said above, PDF to text converters are probably best used with an
external parser. This can be done right now, without changing a line
of code in htdig. I'd rather we didn't toss out the bulk of Sylvain's
work on PDF.cc just yet - for the most part it works now, and with a
bit of tweaking it'll work better!

> Does anyone know what is the best pdf to text parser out there? How about the best
> ps to text parser?

We seem to have a difference of opinion here. Kevin Quinn says that
the latest Ghostscript rocks, and that he's using its pdftops program
as a PDF parser for htdig without any problems. On the other hand,
Geoff Hutchison tried it on Rick Wiggins's test.pdf file, and it didn't
produce any usable text strings in its PostScript output. Could you
guys do a bit more testing with a few other PDF documents to see what
works and what doesn't?

As for PDF to text, I really don't know. I'd appreciate if someone with
the latest Ghostscript tried its PDF to text on my profile_rob_98.pdf
document above, to see if it concatenates words like xpdf's pdftotext
does.

I don't know of any PS to text parser other than Ghostscript's ps2ascii
utility. parse_doc.pl supports it now, and it seems to work, though
I haven't really put it through its paces. It wouldn't surprise me at
all if some PostScript files don't convert well to text, with this or
any other utility. I've seen some pretty ugly PS code in my days.

-- 
Gilles R. Detillieux              E-mail: <grdetil@scrc.umanitoba.ca>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:    (204)789-3930
------------------------------------
To unsubscribe from the htdig mailing list, send a message to
htdig@htdig.org containing the single word "unsubscribe" in
the SUBJECT of the message.



This archive was generated by hypermail 2.0b3 on Fri Feb 26 1999 - 14:34:12 PST