Re: [htdig] Best way to parse PDF?

Gilles Detillieux (
Tue, 15 Jun 1999 13:00:01 -0500 (CDT)

According to Geoff Hutchison:
> On Tue, 15 Jun 1999, Marian Steinbach wrote:
> > Is their a universal way to achieve indexing PDF?
> I'll give a fairly short answer, I'm sure others will probably correct me
> if I'm wrong.
> Yes and no.
> Some programs write PDF files as graphics. This, of course, defeats the
> whole purpose of the format, but it makes it essentially impossible to
> index.
> For the vast majority of PDF files, you'll do very well setting an
> external parser to and using xpdf. There has been quite a bit
> of discussion on this point, and I expect a search for xpdf should turn up
> a bunch.

No universal way, but many of us have found that pdftotext (which comes
with xpdf 0.80) is the best tool for the job. Use it in conjunction with, as described in

You can get the script from

or from the contrib directory in the source for ht://Dig 3.1.2. The
contrib/parsers/ directory on the web site also includes a couple patches
for xpdf 0.80, to improve its handling of oddball spacing in some PDFs
(xpdf-0.80-deltax.patch), and to add a -rawdump option to pdftotext for
indexing multi-column PDFs (xpdf-0.80-rawdump.patch).

Gilles R. Detillieux              E-mail: <>
Spinal Cord Research Centre       WWW:
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:    (204)789-3930
To unsubscribe from the htdig mailing list, send a message to containing the single word "unsubscribe" in
the SUBJECT of the message.

This archive was generated by hypermail 2.0b3 on Tue Jun 15 1999 - 10:14:30 PDT