[htdig] Using pdftotext to index PDF documents


Gilles Detillieux (grdetil@scrc.umanitoba.ca)
Thu, 25 Feb 1999 14:13:46 -0600 (CST)


OK, folks, I'm doing some major back-pedaling! I've decided to give up
on acroread and htdig/PDF.cc after all, and I've switched to pdftotext
in an external parser. Here are my reasons:

1) acroread isn't open source, xpdf is.
2) Parsing PDFs is not straightforward, nor is parsing acroread's PS
   output. Sylvain made a valiant attempt at it, but I think there are
   too many exceptions that don't fit the cases his code handles.
3) Derek Noonburg really did his homework when he developed xpdf and
   pdftotext. Patrick reported that it worked well with all his PDFs.
   I think if we was good, open source support for PDFs, this is the
   way to go.
4) Derek also fixed my problem with pdftotext concatenating words in
   some of my PDFs. There are still a few quirks, where some words
   are concatenated, but it's MUCH better now. Also, pdftotext doesn't
   misplace the large caps like the various PostScript-based solutions
   did. So, with this latest fix, this is the package I want to use!

So, after reconsidering, I think htdig/PDF.cc probably ought to be
scrapped. (Sorry, Sylvain.) I don't know about integrating the xpdf
code right into htdig, but I think as an external parser this is the
package to use. I think Patrick was right that pdftotext does a better
job of extracting text from a PDF than any other tool around.

There's still a bit more work to be done. Patrick mentioned that
pdftotext changed hyphens to spaces. Not so, but parse_doc.pl does.
In fact, it converts all punctuation to spaces, to separate out the words.
The problem is right now, the word list is what it spits out for the
"h" record as well. So there's no punctuation at all in the excerpts!
I'm sure this would be fairly easy to fix, and I hope to get to it later
today. I want to make its text parsing similar to the parsing done by
htdig/Plaintext.cc. (Which raises the question: "why can't an external
parser just pass plain text or HTML to htdig for further parsing?")

Some users may also want to extract the titles from their PDFs, as
Sylvain's code did. parse_doc.pl doesn't do that right now, but with
a bit more coding, using the pdfinfo utility in xpdf, it would be an
easy addition. I haven't done it because my PDFs didn't have reasonable
titles anyway, so I'd just as soon use the file name.

Anyway, here's Derek's fix for my concatenation problem:

--- xpdf/TextOutputDev.cc.deltax Fri Nov 27 21:42:16 1998
+++ xpdf/TextOutputDev.cc Thu Feb 25 09:55:28 1999
@@ -217,6 +217,7 @@ void TextPage::addChar(GfxState *state,
   double x1, y1, w1, h1;
 
   state->transform(x, y, &x1, &y1);
+ dx -= state->getCharSpace();
   state->transformDelta(dx, dy, &w1, &h1);
   curStr->addChar(state, x1, y1, w1, h1, c, useASCII7);
 }

And to bring you up to speed, here is my dialogue with Derek:

> From: Gilles Detillieux <grdetil@scrc.umanitoba.ca>
> Subject: Re: bug in pdftotext in xpdf 0.80
> To: derekn@foolabs.com (Derek B. Noonburg)
> Date: Thu, 25 Feb 1999 10:04:23 -0600 (CST)
>
> Hi again, Derek. Thanks for the prompt response, and the bug fix too!
>
> According to Derek B. Noonburg:
> > > I have some strange PDF files, though, which come from Corel DRAW documents,
> > > and these seem to confuse pdftotext. For example, if you try it out on:
> > >
> > > http://www.scrc.umanitoba.ca/SCRC/profile/profile_rob_98.pdf
> > >
> > > You'll see that most of the words are concatenated. However, when I
> > > view it in xpdf, it looks fine, and when I pass it through pdftops, and
> > > pass the PS file through ps2ascii (from gs 3.33), it also comes out OK.
> > > I'd appreciate it if you can solve this little problem. The file seems
> > > to crank the character spacing way up with a Tc command, and uses this
> > > as a word spacing, rather than using actual space characters or motion
> > > commands.
> >
> > You're right about the cause of the problem. Pdftotext was using the
> > "delta-x" for the character (width + char spacing) instead of just the
> > width.
> >
> > The fix is simple, if you don't mind recompiling. In
> > xpdf/TextOutputDev.cc, insert a line in TextPage::addChar():
> >
> > void TextPage::addChar(GfxState *state, double x, double y,
> > double dx, double dy, Guchar c) {
> > double x1, y1, w1, h1;
> >
> > state->transform(x, y, &x1, &y1);
> > dx -= state->getCharSpace(); // insert this line
> > state->transformDelta(dx, dy, &w1, &h1);
> > curStr->addChar(state, x1, y1, w1, h1, c, useASCII7);
> > }
>
> I don't mind recompiling at all. I'll post a patch to the ht://Dig mailing
> list, as we've been discussing using this tool as a PDF parser for indexing
> PDF documents on a web site. Right now, htdig uses acroread to spit out
> PS, and does some rudimentary parsing on the PS output. It sort of works,
> but there have been problems with it. Also, acroread isn't open source, but
> your tools are, so a lot of users are very interested in switching over.
>
> > > Don't worry about the misplaced Ns -- these happen because Corel DRAW
> > > outputs the large caps before the rest of the text.
> >
> > I just tried pdftotext, and these aren't misplaced... I'm not sure what
> > you mean.
> >
> > Thanks for the bug report.
>
> Thanks for the bug fix! You're right, the Ns aren't misplaced at all. I
> was confusing this tool with the "pdftops ... | ps2ascii" pipeline, which
> did misplace the Ns. All the more reason to use pdftotext for indexing!
>
> Thanks again for the help.

-- 
Gilles R. Detillieux              E-mail: <grdetil@scrc.umanitoba.ca>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:    (204)789-3930
------------------------------------
To unsubscribe from the htdig mailing list, send a message to
htdig@htdig.org containing the single word "unsubscribe" in
the SUBJECT of the message.



This archive was generated by hypermail 2.0b3 on Fri Feb 26 1999 - 14:34:13 PST