Re: [htdig] Best way to parse PDF?


Gilles Detillieux (grdetil@scrc.umanitoba.ca)
Tue, 15 Jun 1999 13:00:01 -0500 (CDT)


According to Geoff Hutchison:
> On Tue, 15 Jun 1999, Marian Steinbach wrote:
> > Is their a universal way to achieve indexing PDF?
>
> I'll give a fairly short answer, I'm sure others will probably correct me
> if I'm wrong.
>
> Yes and no.
>
> Some programs write PDF files as graphics. This, of course, defeats the
> whole purpose of the format, but it makes it essentially impossible to
> index.
>
> For the vast majority of PDF files, you'll do very well setting an
> external parser to parse_doc.pl and using xpdf. There has been quite a bit
> of discussion on this point, and I expect a search for xpdf should turn up
> a bunch.

No universal way, but many of us have found that pdftotext (which comes
with xpdf 0.80) is the best tool for the job. Use it in conjunction with
parse_doc.pl, as described in

        http://www.htdig.org/FAQ.html#q4.9

You can get the script from

        http://www.htdig.org/files/contrib/parsers/

or from the contrib directory in the source for ht://Dig 3.1.2. The
contrib/parsers/ directory on the web site also includes a couple patches
for xpdf 0.80, to improve its handling of oddball spacing in some PDFs
(xpdf-0.80-deltax.patch), and to add a -rawdump option to pdftotext for
indexing multi-column PDFs (xpdf-0.80-rawdump.patch).

-- 
Gilles R. Detillieux              E-mail: <grdetil@scrc.umanitoba.ca>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:    (204)789-3930
------------------------------------
To unsubscribe from the htdig mailing list, send a message to
htdig@htdig.org containing the single word "unsubscribe" in
the SUBJECT of the message.



This archive was generated by hypermail 2.0b3 on Tue Jun 15 1999 - 10:14:30 PDT