Re: [htdig] Using pdftotext to index PDF documents


Gilles Detillieux (grdetil@scrc.umanitoba.ca)
Wed, 3 Mar 1999 16:54:04 -0600 (CST)


Hi again, Derek.

According to Derek B. Noonburg:
>
> [I'm sending this to all three of you because you've all been asking
> about the text extraction code, and because you've all been trying this
> patch...]
>
> > Is there a reason you don't also do "dy -= dy2;" ? Just curious.
> > I imagine dy2 will always be 0 anyway.
>
> dy2 will be zero as long as the text is horizontal, and pdftotext won't
> work anyway if the text is non-horizontal. But I added it in, just for
> sake of correctness.
>
> > This change didn't seem to make any difference in the output generated
> > from my PDFs. There's still a minor problem that remains. In some of
> > my files, pdftotext concatenates the article "a" onto the end of the
> > previous word. E.g. if you run it on
> >
> > http://www.scrc.umanitoba.ca/SCRC/profile/profile_brian_98.pdf
> >
> > you see words like formsa, Discovereda, abovea & asa. Again, it seems
> > to be because of the wierd stuff Corel does with char spacing, but your
> > fix misses these.
>
> You're right - it's the char spacing thing again. Here's yet another
> version of that function, which also happens to fix Andrew's problem
> with the '-' running into 'experience'. This simply looks for
> excessive space between characters and breaks the string into pieces,
> which are later handled by coalesce().
>
> void TextPage::addChar(GfxState *state, double x, double y,
> double dx, double dy, Guchar c) {
> double x1, y1, w1, h1, dx2, dy2;
> int n;
> GBool hexCodes;
>
> state->transform(x, y, &x1, &y1);
> state->textTransformDelta(state->getCharSpace(), 0, &dx2, &dy2);
> dx -= dx2;
> dy -= dy2;
> state->transformDelta(dx, dy, &w1, &h1);
> n = curStr->text->getLength();
> if (n > 0 &&
> x1 - curStr->xRight[n-1] > 0.1 * (curStr->yMax - curStr->yMin)) {
> hexCodes = curStr->hexCodes;
> endString();
> beginString(state, NULL, hexCodes);
> }
> curStr->addChar(state, x1, y1, w1, h1, c, useASCII7);
> }
>
> (I apologize for not sending patch files, but I'm working off my
> development version, which has other differences from 0.80, and that
> makes it hard to get clean patches for specific stuff like this.)

No apologies needed. I appreciate your fixes in whatever form you can
provide them. For those who want a patch for 0.80, here it is:

--- xpdf/TextOutputDev.cc.deltax Fri Nov 27 21:42:16 1998
+++ xpdf/TextOutputDev.cc Wed Mar 3 13:55:01 1999
@@ -214,10 +214,22 @@ void TextPage::beginString(GfxState *sta
 
 void TextPage::addChar(GfxState *state, double x, double y,
                        double dx, double dy, Guchar c) {
- double x1, y1, w1, h1;
+ double x1, y1, w1, h1, dx2, dy2;
+ int n;
+ GBool hexCodes;
 
   state->transform(x, y, &x1, &y1);
+ state->textTransformDelta(state->getCharSpace(), 0, &dx2, &dy2);
+ dx -= dx2;
+ dy -= dy2;
   state->transformDelta(dx, dy, &w1, &h1);
+ n = curStr->text->getLength();
+ if (n > 0 &&
+ x1 - curStr->xRight[n-1] > 0.1 * (curStr->yMax - curStr->yMin)) {
+ hexCodes = curStr->hexCodes;
+ endString();
+ beginString(state, NULL, hexCodes);
+ }
   curStr->addChar(state, x1, y1, w1, h1, c, useASCII7);
 }
 

> > Another problem I discovered is when you use pdftotext to index multi-
> > column PDFs. The program is almost too clever in how it deals with
> > these, producing plain text in a multi-column format. When I run this
> > through the indexing code, it produces excerpts that have text of all
> > the columns stuck together. What I'd really like is an option to
> > "unravel" the columns of text. I imagine the work would have to go
> > in TextPage::coalesce() or TextPage::dump(), but I couldn't figure out
> > what would need to be done there. Any advice would be appreciated.
>
> I actually wouldn't call it "clever" -- it's simpler this way. I've
> been meaning to rewrite the text extraction code from scratch, but I
> haven't had time. I need to come up with a good way to identify
> rectangular blocks of text, and do something intelligent with them.
> (PDF also provides article threads, which connect columns (I think),
> but I don't know how many PDF files actually use them.)

Maybe all we'd need is an option to dump out the strings in the order
in which they appear in the PDF, instead of sorting and coalescing.
You'd lose all your spacing that way, but for indexing, that doesn't
matter. When I watch xpdf refresh the screen, it seems to draw the
text downward one column at a time, so I'm assuming that's the order
in which it appears in the PDF.

I took a first stab at it (patch below). It's not pretty, but it seems
to do a reasonable job on multi-column files from PageMaker. Of course,
it doesn't do as well on my Corel DRAW files, because of the wierd order
in which it outputs things. Fortunately, I don't have both on the same
system, so I can use the -rawdump option on the MUUG system, to index the
newsletters, and avoid this option on the SCRC system, to index the
Corel DRAW files. Please let me know if what I'm doing is really evil
and ugly, or whether it'll pass. As I said, it seems to work for me.

--- xpdf/TextOutputDev.cc.noraw Wed Mar 3 13:56:00 1999
+++ xpdf/TextOutputDev.cc Wed Mar 3 16:40:39 1999
@@ -197,8 +197,9 @@
 // TextPage
 //------------------------------------------------------------------------
 
-TextPage::TextPage(GBool useASCII71) {
+TextPage::TextPage(GBool useASCII71, GBool rawdump1) {
   useASCII7 = useASCII71;
+ rawdump = rawdump1;
   curStr = NULL;
   yxStrings = NULL;
   xyStrings = NULL;
@@ -258,6 +259,8 @@
   y1 = curStr->yMin + 0.5 * h;
   y2 = curStr->yMin + 0.8 * h;
   for (p1 = NULL, p2 = yxStrings; p2; p1 = p2, p2 = p2->yxNext) {
+ if (rawdump)
+ continue;
     if (y1 < p2->yMin || (y2 < p2->yMax && curStr->xMax < p2->xMin))
       break;
   }
@@ -284,6 +287,10 @@
 #endif
   str1 = yxStrings;
   while (str1 && (str2 = str1->yxNext)) {
+ if (rawdump && (str1->yMin != str2->yMin || str1->yMax != str2->yMax)) {
+ str1 = str2;
+ continue;
+ }
     space = str1->yMax - str1->yMin;
     d = str2->xMin - str1->xMax;
 #if 0 //~tmp
@@ -479,6 +486,8 @@
   for (str1 = yxStrings; str1; str1 = str1->yxNext) {
 
     // line this string up with the correct column
+ if (rawdump && str1->col-col1 > 8)
+ col1 = col1 == 0 ? str1->col : str1->col-1;
     for (; col1 < str1->col; ++col1)
       fputc(' ', f);
 
@@ -493,6 +502,8 @@
       yMax = str1->yMax;
 
     // if we've hit the end of the line...
+ if (rawdump && str1->yxNext && str1->yxNext->yMax < str1->yMin)
+ str1->yMin = str1->yMax = 0;
 #if 0 //~
     if (!(str1->yxNext && str1->yxNext->yMin < str1->yMax &&
           str1->yxNext->xMin >= str1->xMax)) {
@@ -520,6 +531,8 @@
           
         // print the space
         d = (int)((yMin - yMax) / (str1->yMax - str1->yMin) + 0.5);
+ if (rawdump && d > 2)
+ d = 2;
         for (; d > 0; --d)
           fputc('\n', f);
       }
@@ -550,7 +563,7 @@
 // TextOutputDev
 //------------------------------------------------------------------------
 
-TextOutputDev::TextOutputDev(char *fileName, GBool useASCII7) {
+TextOutputDev::TextOutputDev(char *fileName, GBool useASCII7, GBool rawdump) {
   text = NULL;
   ok = gTrue;
 
@@ -571,7 +584,7 @@
   }
 
   // set up text object
- text = new TextPage(useASCII7);
+ text = new TextPage(useASCII7, rawdump);
 }
 
 TextOutputDev::~TextOutputDev() {
--- xpdf/TextOutputDev.h.noraw Fri Nov 27 21:42:17 1998
+++ xpdf/TextOutputDev.h Wed Mar 3 16:12:40 1999
@@ -61,7 +61,7 @@
 public:
 
   // Constructor.
- TextPage(GBool useASCII71);
+ TextPage(GBool useASCII71, GBool rawdump1 = gFalse);
 
   // Destructor.
   ~TextPage();
@@ -101,6 +101,7 @@
 private:
 
   GBool useASCII7; // use 7-bit ASCII?
+ GBool rawdump; // dump raw PDF text strings?
 
   TextString *curStr; // currently active string
 
@@ -118,8 +119,9 @@
   // Open a text output file. If <fileName> is NULL, no file is written
   // (this is useful, e.g., for searching text). If <useASCII7> is true,
   // text is converted to 7-bit ASCII; otherwise, text is converted to
- // 8-bit ISO Latin-1.
- TextOutputDev(char *fileName, GBool useASCII7);
+ // 8-bit ISO Latin-1. If <rawdump> is true, PDF text strings are dumped
+ // in the order they're found, without sorting and coalescing.
+ TextOutputDev(char *fileName, GBool useASCII7, GBool rawdump = gFalse);
 
   // Destructor.
   virtual ~TextOutputDev();
--- xpdf/pdftotext.cc.noraw Fri Nov 27 21:42:16 1998
+++ xpdf/pdftotext.cc Wed Mar 3 15:40:55 1999
@@ -29,6 +29,7 @@
 static int firstPage = 1;
 static int lastPage = 0;
 static GBool useASCII7 = gFalse;
+static GBool rawdump = gFalse;
 GBool printCommands = gFalse;
 static GBool printHelp = gFalse;
 
@@ -39,6 +40,8 @@
    "last page to convert"},
   {"-ascii7", argFlag, &useASCII7, 0,
    "convert to 7-bit ASCII (default is 8-bit ISO Latin-1)"},
+ {"-rawdump",argFlag, &rawdump, 0,
+ "dump out raw strings from PDF (default is to sort and coalesce)"},
   {"-h", argFlag, &printHelp, 0,
    "print usage information"},
   {"-help", argFlag, &printHelp, 0,
@@ -96,7 +99,7 @@
     lastPage = doc->getNumPages();
 
   // write text file
- textOut = new TextOutputDev(textFileName->getCString(), useASCII7);
+ textOut = new TextOutputDev(textFileName->getCString(), useASCII7, rawdump);
   if (textOut->isOk())
     doc->displayPages(textOut, firstPage, lastPage, 72, 0);
   delete textOut;

-- 
Gilles R. Detillieux              E-mail: <grdetil@scrc.umanitoba.ca>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:    (204)789-3930
------------------------------------
To unsubscribe from the htdig mailing list, send a message to
htdig@htdig.org containing the single word "unsubscribe" in
the SUBJECT of the message.



This archive was generated by hypermail 2.0b3 on Thu Mar 04 1999 - 09:09:19 PST