Re: [htdig] Using pdftotext to index PDF documents


Gilles Detillieux (grdetil@scrc.umanitoba.ca)
Thu, 4 Mar 1999 14:17:17 -0600 (CST)


According to Patrick Dugal:
> Does anybody have any idea if and when ht://Dig will start using xpdf's pdftotext?

I don't think anyone has any plans to integrate pdftotext right into
ht://Dig. It may be usable as part of the proposed external_docoders
enhancement, if/when that gets done. Right now, the only option is
to use it as part of an external parser, as I explained last week.

> I would really like to be able index the pdf documents with a more reliable parser
> with ht://Dig soon. I don't want to throw Sylvain Wallez's work out the window,
> but that's probably what it's going to take for a significant improvement.

I agree that that's what it takes. As I said last week, I've reconsidered
my stand on this. Derek has done his homework, and has produced a
reliable parser for PDFs. Sylvain's code is OK as an initial stab,
but would require major rewriting to work as well as Derek's (or even
to work with Derek's pdftops). But why reinvent the wheel? pdftotext
works great as it is, especially with Derek's latest fix for it.

However, you don't need to physically take Sylvain's code out of htdig to
use pdftotext in its place. The way htdig is designed, you can define
an external parser for any MIME type, and it will override the internal
parser for that type, if an internal parser exists.

> Can you let me know when most of your PDF's are working with pdftotext so that I
> can start to index PDF's with pdftotext. Is parse_doc.pl ready for use with
> pdftotext? Which patches should be applied to which versions of xpdf for the best
> results?

I've been using pdftotext as part of an external parser for about a
week now, with very satisfactory results. I thought I had stated
clearly last week that I had given up on Sylvain's code, and am now
using pdftotext in my parse_doc.pl script. You can pick up the latest
version of parse_doc.pl from

        http://www.scrc.umanitoba.ca/htdig/rpms/parse_doc.pl

As for which patches and which versions of xpdf, I've only ever used xpdf
0.80, and the patches for it are quoted below. They were both in the
message I cc'ed to you yesterday. The first one is Derek's final correction
for the delta-x problem that caused concatenation of words in my PDFs.
It must be applied to the original xpdf/TextOutputDev.cc file.

The second one, which is totally optional (and still experimental), adds
a -rawdump option to pdftotext, to disable the sorting and coalescing of
strings. This breaks up columns in multi-column PDFs, by instead dumping
the text strings in the order in which they appear in the PDF. I plan to
use -rawdump as an option in the $parsecmd string in parse_doc.pl, for
PDFs on www.muug.mb.ca, which are multi-column PageMaker newsletters,
but not on www.scrc.umanitoba.ca, because the Corel DRAW files on my
system work better with the sorting and coalescing. You can decide
whether this option works for your files or not, by trying it out, but
if you don't have any multi-column output in your PDFs, you won't need
this option, so you won't need to apply my patch for it.

I haven't heard back from Derek yet about this patch, but as I said,
it works for me.

You can also get my xpdf patches from

        http://www.scrc.umanitoba.ca/htdig/rpms/xpdf-0.80-deltax.patch
        http://www.scrc.umanitoba.ca/htdig/rpms/xpdf-0.80-rawdump.patch

> Is there anything I can do to help?

As far as I'm concerned, the problem is solved already. All you have to
do is configure your system to use it.

If you want to take the time to integrate pdftotext more tightly into
htdig, you're welcome to work on that. I decided it wasn't worth the
bother, as it works fine for me within an external parser.

I guess writing up some documentation for it would help too.

> Gilles Detillieux wrote:
>
> > Hi again, Derek.
> >
> > According to Derek B. Noonburg:
> > >
> > > [I'm sending this to all three of you because you've all been asking
> > > about the text extraction code, and because you've all been trying this
> > > patch...]
> > >
> > > > Is there a reason you don't also do "dy -= dy2;" ? Just curious.
> > > > I imagine dy2 will always be 0 anyway.
> > >
> > > dy2 will be zero as long as the text is horizontal, and pdftotext won't
> > > work anyway if the text is non-horizontal. But I added it in, just for
> > > sake of correctness.
> > >
> > > > This change didn't seem to make any difference in the output generated
> > > > from my PDFs. There's still a minor problem that remains. In some of
> > > > my files, pdftotext concatenates the article "a" onto the end of the
> > > > previous word. E.g. if you run it on
> > > >
> > > > http://www.scrc.umanitoba.ca/SCRC/profile/profile_brian_98.pdf
> > > >
> > > > you see words like formsa, Discovereda, abovea & asa. Again, it seems
> > > > to be because of the wierd stuff Corel does with char spacing, but your
> > > > fix misses these.
> > >
> > > You're right - it's the char spacing thing again. Here's yet another
> > > version of that function, which also happens to fix Andrew's problem
> > > with the '-' running into 'experience'. This simply looks for
> > > excessive space between characters and breaks the string into pieces,
> > > which are later handled by coalesce().
> > >
> > > void TextPage::addChar(GfxState *state, double x, double y,
> > > double dx, double dy, Guchar c) {
> > > double x1, y1, w1, h1, dx2, dy2;
> > > int n;
> > > GBool hexCodes;
> > >
> > > state->transform(x, y, &x1, &y1);
> > > state->textTransformDelta(state->getCharSpace(), 0, &dx2, &dy2);
> > > dx -= dx2;
> > > dy -= dy2;
> > > state->transformDelta(dx, dy, &w1, &h1);
> > > n = curStr->text->getLength();
> > > if (n > 0 &&
> > > x1 - curStr->xRight[n-1] > 0.1 * (curStr->yMax - curStr->yMin)) {
> > > hexCodes = curStr->hexCodes;
> > > endString();
> > > beginString(state, NULL, hexCodes);
> > > }
> > > curStr->addChar(state, x1, y1, w1, h1, c, useASCII7);
> > > }
> > >
> > > (I apologize for not sending patch files, but I'm working off my
> > > development version, which has other differences from 0.80, and that
> > > makes it hard to get clean patches for specific stuff like this.)
> >
> > No apologies needed. I appreciate your fixes in whatever form you can
> > provide them. For those who want a patch for 0.80, here it is:
> >
> > --- xpdf/TextOutputDev.cc.deltax Fri Nov 27 21:42:16 1998
> > +++ xpdf/TextOutputDev.cc Wed Mar 3 13:55:01 1999
> > @@ -214,10 +214,22 @@ void TextPage::beginString(GfxState *sta
> >
> > void TextPage::addChar(GfxState *state, double x, double y,
> > double dx, double dy, Guchar c) {
> > - double x1, y1, w1, h1;
> > + double x1, y1, w1, h1, dx2, dy2;
> > + int n;
> > + GBool hexCodes;
> >
> > state->transform(x, y, &x1, &y1);
> > + state->textTransformDelta(state->getCharSpace(), 0, &dx2, &dy2);
> > + dx -= dx2;
> > + dy -= dy2;
> > state->transformDelta(dx, dy, &w1, &h1);
> > + n = curStr->text->getLength();
> > + if (n > 0 &&
> > + x1 - curStr->xRight[n-1] > 0.1 * (curStr->yMax - curStr->yMin)) {
> > + hexCodes = curStr->hexCodes;
> > + endString();
> > + beginString(state, NULL, hexCodes);
> > + }
> > curStr->addChar(state, x1, y1, w1, h1, c, useASCII7);
> > }
> >
> >
> > > > Another problem I discovered is when you use pdftotext to index multi-
> > > > column PDFs. The program is almost too clever in how it deals with
> > > > these, producing plain text in a multi-column format. When I run this
> > > > through the indexing code, it produces excerpts that have text of all
> > > > the columns stuck together. What I'd really like is an option to
> > > > "unravel" the columns of text. I imagine the work would have to go
> > > > in TextPage::coalesce() or TextPage::dump(), but I couldn't figure out
> > > > what would need to be done there. Any advice would be appreciated.
> > >
> > > I actually wouldn't call it "clever" -- it's simpler this way. I've
> > > been meaning to rewrite the text extraction code from scratch, but I
> > > haven't had time. I need to come up with a good way to identify
> > > rectangular blocks of text, and do something intelligent with them.
> > > (PDF also provides article threads, which connect columns (I think),
> > > but I don't know how many PDF files actually use them.)
> >
> > Maybe all we'd need is an option to dump out the strings in the order
> > in which they appear in the PDF, instead of sorting and coalescing.
> > You'd lose all your spacing that way, but for indexing, that doesn't
> > matter. When I watch xpdf refresh the screen, it seems to draw the
> > text downward one column at a time, so I'm assuming that's the order
> > in which it appears in the PDF.
> >
> > I took a first stab at it (patch below). It's not pretty, but it seems
> > to do a reasonable job on multi-column files from PageMaker. Of course,
> > it doesn't do as well on my Corel DRAW files, because of the wierd order
> > in which it outputs things. Fortunately, I don't have both on the same
> > system, so I can use the -rawdump option on the MUUG system, to index the
> > newsletters, and avoid this option on the SCRC system, to index the
> > Corel DRAW files. Please let me know if what I'm doing is really evil
> > and ugly, or whether it'll pass. As I said, it seems to work for me.
> >
> > --- xpdf/TextOutputDev.cc.noraw Wed Mar 3 13:56:00 1999
> > +++ xpdf/TextOutputDev.cc Wed Mar 3 16:40:39 1999
> > @@ -197,8 +197,9 @@
> > // TextPage
> > //------------------------------------------------------------------------
> >
> > -TextPage::TextPage(GBool useASCII71) {
> > +TextPage::TextPage(GBool useASCII71, GBool rawdump1) {
> > useASCII7 = useASCII71;
> > + rawdump = rawdump1;
> > curStr = NULL;
> > yxStrings = NULL;
> > xyStrings = NULL;
> > @@ -258,6 +259,8 @@
> > y1 = curStr->yMin + 0.5 * h;
> > y2 = curStr->yMin + 0.8 * h;
> > for (p1 = NULL, p2 = yxStrings; p2; p1 = p2, p2 = p2->yxNext) {
> > + if (rawdump)
> > + continue;
> > if (y1 < p2->yMin || (y2 < p2->yMax && curStr->xMax < p2->xMin))
> > break;
> > }
> > @@ -284,6 +287,10 @@
> > #endif
> > str1 = yxStrings;
> > while (str1 && (str2 = str1->yxNext)) {
> > + if (rawdump && (str1->yMin != str2->yMin || str1->yMax != str2->yMax)) {
> > + str1 = str2;
> > + continue;
> > + }
> > space = str1->yMax - str1->yMin;
> > d = str2->xMin - str1->xMax;
> > #if 0 //~tmp
> > @@ -479,6 +486,8 @@
> > for (str1 = yxStrings; str1; str1 = str1->yxNext) {
> >
> > // line this string up with the correct column
> > + if (rawdump && str1->col-col1 > 8)
> > + col1 = col1 == 0 ? str1->col : str1->col-1;
> > for (; col1 < str1->col; ++col1)
> > fputc(' ', f);
> >
> > @@ -493,6 +502,8 @@
> > yMax = str1->yMax;
> >
> > // if we've hit the end of the line...
> > + if (rawdump && str1->yxNext && str1->yxNext->yMax < str1->yMin)
> > + str1->yMin = str1->yMax = 0;
> > #if 0 //~
> > if (!(str1->yxNext && str1->yxNext->yMin < str1->yMax &&
> > str1->yxNext->xMin >= str1->xMax)) {
> > @@ -520,6 +531,8 @@
> >
> > // print the space
> > d = (int)((yMin - yMax) / (str1->yMax - str1->yMin) + 0.5);
> > + if (rawdump && d > 2)
> > + d = 2;
> > for (; d > 0; --d)
> > fputc('\n', f);
> > }
> > @@ -550,7 +563,7 @@
> > // TextOutputDev
> > //------------------------------------------------------------------------
> >
> > -TextOutputDev::TextOutputDev(char *fileName, GBool useASCII7) {
> > +TextOutputDev::TextOutputDev(char *fileName, GBool useASCII7, GBool rawdump) {
> > text = NULL;
> > ok = gTrue;
> >
> > @@ -571,7 +584,7 @@
> > }
> >
> > // set up text object
> > - text = new TextPage(useASCII7);
> > + text = new TextPage(useASCII7, rawdump);
> > }
> >
> > TextOutputDev::~TextOutputDev() {
> > --- xpdf/TextOutputDev.h.noraw Fri Nov 27 21:42:17 1998
> > +++ xpdf/TextOutputDev.h Wed Mar 3 16:12:40 1999
> > @@ -61,7 +61,7 @@
> > public:
> >
> > // Constructor.
> > - TextPage(GBool useASCII71);
> > + TextPage(GBool useASCII71, GBool rawdump1 = gFalse);
> >
> > // Destructor.
> > ~TextPage();
> > @@ -101,6 +101,7 @@
> > private:
> >
> > GBool useASCII7; // use 7-bit ASCII?
> > + GBool rawdump; // dump raw PDF text strings?
> >
> > TextString *curStr; // currently active string
> >
> > @@ -118,8 +119,9 @@
> > // Open a text output file. If <fileName> is NULL, no file is written
> > // (this is useful, e.g., for searching text). If <useASCII7> is true,
> > // text is converted to 7-bit ASCII; otherwise, text is converted to
> > - // 8-bit ISO Latin-1.
> > - TextOutputDev(char *fileName, GBool useASCII7);
> > + // 8-bit ISO Latin-1. If <rawdump> is true, PDF text strings are dumped
> > + // in the order they're found, without sorting and coalescing.
> > + TextOutputDev(char *fileName, GBool useASCII7, GBool rawdump = gFalse);
> >
> > // Destructor.
> > virtual ~TextOutputDev();
> > --- xpdf/pdftotext.cc.noraw Fri Nov 27 21:42:16 1998
> > +++ xpdf/pdftotext.cc Wed Mar 3 15:40:55 1999
> > @@ -29,6 +29,7 @@
> > static int firstPage = 1;
> > static int lastPage = 0;
> > static GBool useASCII7 = gFalse;
> > +static GBool rawdump = gFalse;
> > GBool printCommands = gFalse;
> > static GBool printHelp = gFalse;
> >
> > @@ -39,6 +40,8 @@
> > "last page to convert"},
> > {"-ascii7", argFlag, &useASCII7, 0,
> > "convert to 7-bit ASCII (default is 8-bit ISO Latin-1)"},
> > + {"-rawdump",argFlag, &rawdump, 0,
> > + "dump out raw strings from PDF (default is to sort and coalesce)"},
> > {"-h", argFlag, &printHelp, 0,
> > "print usage information"},
> > {"-help", argFlag, &printHelp, 0,
> > @@ -96,7 +99,7 @@
> > lastPage = doc->getNumPages();
> >
> > // write text file
> > - textOut = new TextOutputDev(textFileName->getCString(), useASCII7);
> > + textOut = new TextOutputDev(textFileName->getCString(), useASCII7, rawdump);
> > if (textOut->isOk())
> > doc->displayPages(textOut, firstPage, lastPage, 72, 0);
> > delete textOut;
> >
> > --
> > Gilles R. Detillieux E-mail: <grdetil@scrc.umanitoba.ca>
> > Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/~grdetil
> > Dept. Physiology, U. of Manitoba Phone: (204)789-3766
> > Winnipeg, MB R3E 3J7 (Canada) Fax: (204)789-3930
> > ------------------------------------
> > To unsubscribe from the htdig mailing list, send a message to
> > htdig@htdig.org containing the single word "unsubscribe" in
> > the SUBJECT of the message.
>

-- 
Gilles R. Detillieux              E-mail: <grdetil@scrc.umanitoba.ca>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:    (204)789-3930
------------------------------------
To unsubscribe from the htdig mailing list, send a message to
htdig@htdig.org containing the single word "unsubscribe" in
the SUBJECT of the message.



This archive was generated by hypermail 2.0b3 on Mon Mar 15 1999 - 08:57:45 PST