Re: [htdig] PDF parsing in htdig/PDF.cc


Gilles Detillieux (grdetil@scrc.umanitoba.ca)
Wed, 24 Feb 1999 17:22:24 -0600 (CST)


According to Patrick Dugal:
> I see the problem now. I think the people that are programming pdftotext would love to
> solve this concatenation problem. Someone just has to send them a link to your pdf. Do
> you want to look into it?

Good suggestion. Yes, I'll e-mail Derek Noonburg about this. Thanks.

> The output from GhostScript's ps2ascii on profile_rob_98.pdf is enclosed. I used
> acroread to convert to PS and then ps2ascii to convert to ascii. The ouput is better
> than xpdf's pdftotext but it's not perfect, check it out for yourself. The N's are
> misplaced.

OK. That's about as good as I ever got with my fixes to PDF.cc. I had
given up on the misplaced Ns, and the occasional word break within a
word in some documents like this one. As long as most of the text is
indexed properly, I'm happy. These Corel DRAW files are just too wierd.
The misplaced Ns happen because Corel outputs all the large caps first,
before the rest of the text. Sheesh!

I actually got pretty much the same result as you by running the PDF
file through xpdf's pdftops, then the PS through ps2ascii. That may be
an option too!

However, what I'd really like to see is the output of the latest's
Ghostscript's PDF to text converter, as well as its PDF to PS converter.
Could you send me these, please? (No need to post to the list, though.)
I know I should probably just upgrade to the latest version myself, but
I've just got too many other things in the queue right now.

I have its PS to text converter, which already works reasonably well
with the old 3.33 version I have.

> It seems as if there doesn't exist a good parser for PostScript nor PDF's. This parsing
> business is more complicated than I originally thought. I don't know where to go where
> from here. None of the PDF parsers I've tried (ht://Dig's PDF.cc, GS's ps2ascii, xpdf's
> pdftotext, my own PERL attempt) seem to function consistently.
>
> I think PDF and PS parser programmers are going to be banging their heads on the keyboard
> if and when Adobe changes their PDF standards.

Let's just hope Adobe sticks to their standard. Trouble is, these PDF
files can come from a number if different sources, and some of these
sources (like Corel DRAW) do some pretty ugly stuff. PostScript files
are even worse, so parsing these documents is not straightforward.
I just hope we can get something that works reasonably well in most cases.

> Does anyone know if Adobe will produce a good pdf to text parser as part of acroread?
> Should I contact the people at Adobe?

Can't hurt! However, if they do come up with something, you can be
pretty sure it will be a binary-only release, and not open source.
That leaves some htdig users out in the cold, so if we can also work
out some open source alternatives, then everyone's happy.

-- 
Gilles R. Detillieux              E-mail: <grdetil@scrc.umanitoba.ca>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:    (204)789-3930
------------------------------------
To unsubscribe from the htdig mailing list, send a message to
htdig@htdig.org containing the single word "unsubscribe" in
the SUBJECT of the message.



This archive was generated by hypermail 2.0b3 on Fri Feb 26 1999 - 14:34:13 PST