Re: [htdig] PDF parsing in htdig/PDF.cc


Patrick Dugal (patrick.dugal@nrc.ca)
Wed, 24 Feb 1999 17:23:22 -0500


The output from GhostScript's ps2ascii on profile_rob_98.pdf is enclosed. I used
acroread to convert to PS and then ps2ascii to convert to ascii. The ouput is better
than xpdf's pdftotext but it's not perfect, check it out for yourself. The N's are
misplaced.

It seems as if there doesn't exist a good parser for PostScript nor PDF's. This parsing
business is more complicated than I originally thought. I don't know where to go where
from here. None of the PDF parsers I've tried (ht://Dig's PDF.cc, GS's ps2ascii, xpdf's
pdftotext, my own PERL attempt) seem to function consistently.

I think PDF and PS parser programmers are going to be banging their heads on the keyboard
if and when Adobe changes their PDF standards.

Does anyone know if Adobe will produce a good pdf to text parser as part of acroread?
Should I contact the people at Adobe?

Pat :)

Gilles Detillieux wrote:

> According to Patrick Dugal:
> > I have tried writing my own pdf2text PERL program which parses the output from
> > acroread. But I later found out that unless I was prepared to read a lot about
> > PDF
> > files, I wasn't going to be very successful. I think the people who programed
> > xpdf's pdftotext really took the time to understand many massive documents about
> > PDF files. I don't want to recreate the wheel. The only successful parser I have
> > seen work on pdf's is pdf2text.
>
> Well, if you're going to work with a PDF to PS converter, like acroread, or
> xpdf's pdftops utility, then the best place for changes is right in
> htdig/PDF.cc, rather than reinventing the wheel with an external parser.
>
> On the other hand, if you're going to work with a PDF to text converter,
> like xpdf's pdftotext, or Ghostscript's pdf2text (or is it pdf2ascii?), then
> an external parser is probably the way to go. This is the approach used
> for MS Word files (using catdoc) and PostScript files (using gs's ps2ascii).
>
> > I looked at PDF.cc from 3.1.1 and it doesn't appear to have changed the way it
> > parses the output from the acroread since a few versions back. I tried setting
> > external_parsers to a tweaked version of htparsedoc which invokes xpdf's pdftotext,
> > but pdftotext doesn't seem to be receiving the name of the file to parse as an
> > argument. Therefore, the parser outputs useless info, and the data about the pdf
> > doesn't enter the database at all. How can this be fixed?
>
> Sounds like you weren't passing the right arguments to the parser. Try the
> patch to parse_doc.pl that I posted a little earlier today. I haven't tested
> it thoroughly, but it seems to produce the required output.
>
> In another message, Patrick said:
> > Has anyone had any problems with xpdf's pdftotext (with decryption patch)? Maybe
> > the PDF.cc could solely rely on pdftotext instead of acroread and it's internal
> > parsing? I have tested pdftotext with many pdf's and it seems to work so far on
> > all the ones PDF.cc failed on.
>
> As I reported earlier, I still have a problem with it. If you try it on
> this document:
>
> http://www.scrc.umanitoba.ca/SCRC/profile/profile_rob_98.pdf
>
> you'll see what I mean. pdftotext spits out concatenated words from this
> document.
>
> > According to the xpdf README, many documents from Adobe were consulted when
> > pdftotext was written. I think that the value of making PDF.cc use pdftotext would
> > represent a significant improvement.
>
> Yes, I have no trouble believing that they did their homework. It looks
> like very clean code, from what I've seen of it. Unfortunately,
> it doesn't prevent the software that generates the PDFs from doing
> something stupid. I think that's the problem I ran into, and it's
> probably Corel DRAW's fault. However, my recent changes to PDF.cc,
> to handle the large Tc character spacing that Corel uses for separating
> words, fixes the problem for me. Unfortunately, the concatenation problem
> that you have seems to have a different cause. PDF.cc currently ignores
> the numbers in the array given to the the TJ command, which it shouldn't.
> I'll have to see how xdpf deals with them, and hopefully do something
> sensible in PDF.cc as well.
>
> > Has anyone tried to tweak and test PDF.cc so that it relies solely on pdftotext?
> > If not, I will and let the list know if there is any significant improvement.
>
> As I said above, PDF to text converters are probably best used with an
> external parser. This can be done right now, without changing a line
> of code in htdig. I'd rather we didn't toss out the bulk of Sylvain's
> work on PDF.cc just yet - for the most part it works now, and with a
> bit of tweaking it'll work better!
>
> > Does anyone know what is the best pdf to text parser out there? How about the best
> > ps to text parser?
>
> We seem to have a difference of opinion here. Kevin Quinn says that
> the latest Ghostscript rocks, and that he's using its pdftops program
> as a PDF parser for htdig without any problems. On the other hand,
> Geoff Hutchison tried it on Rick Wiggins's test.pdf file, and it didn't
> produce any usable text strings in its PostScript output. Could you
> guys do a bit more testing with a few other PDF documents to see what
> works and what doesn't?
>
> As for PDF to text, I really don't know. I'd appreciate if someone with
> the latest Ghostscript tried its PDF to text on my profile_rob_98.pdf
> document above, to see if it concatenates words like xpdf's pdftotext
> does.
>
> I don't know of any PS to text parser other than Ghostscript's ps2ascii
> utility. parse_doc.pl supports it now, and it seems to work, though
> I haven't really put it through its paces. It wouldn't surprise me at
> all if some PostScript files don't convert well to text, with this or
> any other utility. I've seen some pretty ugly PS code in my days.
>
> --
> Gilles R. Detillieux E-mail: <grdetil@scrc.umanitoba.ca>
> Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/~grdetil
> Dept. Physiology, U. of Manitoba Phone: (204)789-3766
> Winnipeg, MB R3E 3J7 (Canada) Fax: (204)789-3930

N

N

N

ew concept: The output from the nervous system to muscles used in walking is controlled by regulation of intrinsic properties of spinal motoneurons from descending pathways and from the central pattern generator for locomotion. These findings have placed Dr. Brownstone at the forefront of workers in the field of motoneuron control, and provide the basis for development of novel strategies for recovery of function after injury.

ew technology: An preparation of the isolated mouse spinal cord has been developed in Dr. Brownstone's laboratory. This preparation enables researchers of the Spinal Cord Research Centre to take advantage of transgenic mouse technology for the study of models of disease and injury, and the genetic factors which can be manipulated to promote functional recovery.

eurotransplantation: Immature cells can be transplanted into a severed nerve, survive, mature into motonuerones, and effect muscle contraction. This finding has implications for the treatment of spinal cord injury and ALS.

Identify the inputs to motoneurons responsible for control of motoneuron intrinsic properties.

Determine the mechanism of action of immunoglobulins from patients with ALS (which transfer ALS to mice) on motoneurons in the isolated mouse spinal cord, with the long range goal of producing new treatments for ALS and other motoneuron diseases.

Develop methodology for use of implanted stimulators and intrathecal drug injections in paraplegic patients for eliciting locomotor movements (with B. Schmidt and P. Nance).

Spinal Cord Research Centre: Next Steps

in vitro

*

*

*

Robert Brownstone, Ph.D, M.D. Assistant Professor, Surgery Adjunct Professor, Department of Physiology

------------------------------------
To unsubscribe from the htdig mailing list, send a message to
htdig@htdig.org containing the single word "unsubscribe" in
the SUBJECT of the message.



This archive was generated by hypermail 2.0b3 on Fri Feb 26 1999 - 14:34:13 PST