Re: [htdig] PDF and PostScript Parsing


Gilles Detillieux (grdetil@scrc.umanitoba.ca)
Tue, 23 Feb 1999 13:45:18 -0600 (CST)


I've followed up privately with Patrick, but for the benefit of others
on the list, I'll give my most recent findings here.

According to me:
> According to Patrick Dugal:
> > As you can tell from this snippet, the internal parsing of
> > the acroread output is not quite what you'd expect. The
> > strings get concatenated somehow and so the data becomes
> > nearly useless and impossible to search.
>
> Looks like the same problem I had with some PDF files. I fixed it in
> 3.1.1, so do give that a try. It may very well fix the problem with your
> files too. My files were generated by Adobe Acrobat PDF Writer, from
> Corel DRAW files, but the same effect may occur with other file types too.
> The problem is that sometimes the inter-word spacing is generated by
> cranking up the character spacing, rather than actually using a space
> character, or a motion command. The latest version of PDF.cc does try
> to deal with this, and I'd appreciate further testing by others to make
> sure my assumptions about the spacing threshold are correct.
>
> > Is there a way I can configure htdig to disable the internal
> > parsing of the acroread output? I'd like to use the
> > pdftotext program included in the xpdf software to do the
> > whole conversion from PDF to text and have htdig receive
> > this file internally in the indexing process. How would I
> > go about doing that without changing the source code?
> >
> > Any of your suggestion would be very helpful.
>
> Yup, you can define an external parser, and it should override the
> internal one. You could use the parse_doc.pl perl script (included
> in 3.1.1's contrib directory) as a starting point. Add to it a bit
> of code to recognise the PDF file magic string ("%PDF-" should do it),
> and call pdftotext to parse the PDF file into text.

I don't know how well pdftotext will work as part of an external parser.
I just tried pdftotext myself on one of the documents that had given me
the concatenation problem in earlier versions of htdig. To solve this
concatenation problem, you need something that can handle the silly
character spacing in some PDF files. That means your best bet is to
use acroread as your pdf_parser, with the latest version of htdig.

> I'm also going to look into using the pdftops program, included with xpdf,
> as a PDF parser for use with the internal PDF.cc code. Earlier reports
> on this list claimed it worked, and that was the reason for moving
> the acroread-specific options into the pdf_parser attribute. However,
> Joe Jah just reported yesterday that it doesn't seem to work at all,
> and he claims to be using the latest version of xpdf. I'll let you know
> if I get that working.

I can confirm that pdftops from xpdf 0.80 won't work as a pdf_parser
with htdig. It still does NOT produce BT and ET tags, so PDF.cc just
skims through the PostScript output from pdftops without indexing anything.

Can those who claimed it did work please let us know what they modified
to get it to work?

-- 
Gilles R. Detillieux              E-mail: <grdetil@scrc.umanitoba.ca>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:    (204)789-3930
------------------------------------
To unsubscribe from the htdig mailing list, send a message to
htdig@htdig.org containing the single word "unsubscribe" in
the SUBJECT of the message.



This archive was generated by hypermail 2.0b3 on Fri Feb 26 1999 - 14:34:12 PST