Gilles Detillieux (email@example.com)
Tue, 15 Jun 1999 13:00:01 -0500 (CDT)
According to Geoff Hutchison:
> On Tue, 15 Jun 1999, Marian Steinbach wrote:
> > Is their a universal way to achieve indexing PDF?
> I'll give a fairly short answer, I'm sure others will probably correct me
> if I'm wrong.
> Yes and no.
> Some programs write PDF files as graphics. This, of course, defeats the
> whole purpose of the format, but it makes it essentially impossible to
> For the vast majority of PDF files, you'll do very well setting an
> external parser to parse_doc.pl and using xpdf. There has been quite a bit
> of discussion on this point, and I expect a search for xpdf should turn up
> a bunch.
No universal way, but many of us have found that pdftotext (which comes
with xpdf 0.80) is the best tool for the job. Use it in conjunction with
parse_doc.pl, as described in
You can get the script from
or from the contrib directory in the source for ht://Dig 3.1.2. The
contrib/parsers/ directory on the web site also includes a couple patches
for xpdf 0.80, to improve its handling of oddball spacing in some PDFs
(xpdf-0.80-deltax.patch), and to add a -rawdump option to pdftotext for
indexing multi-column PDFs (xpdf-0.80-rawdump.patch).
-- Gilles R. Detillieux E-mail: <firstname.lastname@example.org> Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/~grdetil Dept. Physiology, U. of Manitoba Phone: (204)789-3766 Winnipeg, MB R3E 3J7 (Canada) Fax: (204)789-3930 ------------------------------------ To unsubscribe from the htdig mailing list, send a message to email@example.com containing the single word "unsubscribe" in the SUBJECT of the message.
This archive was generated by hypermail 2.0b3 on Tue Jun 15 1999 - 10:14:30 PDT