Re: [htdig] Using pdftotext to index PDF documents


Patrick Dugal (patrick.dugal@nrc.ca)
Mon, 01 Mar 1999 13:37:53 -0500


Gilles Detillieux wrote:

> There's still a bit more work to be done. Patrick mentioned that
> pdftotext changed hyphens to spaces.

I don't think I ever said that. I mentioned that GhostScript's ps2ascii which takes
pdf as input whenever it feels like it, translates hyphens (-) into spaces. Xpdf's
pdftotext leaves the hyphens in, just as they are in the pdf. This may hinder the
results of a search, but at least it's consistent.

> (Which raises the question: "why can't an external
> parser just pass plain text or HTML to htdig for further parsing?")

Very good question. By intuition, I thought this was the way it should work. This
way, it would be easier to configure, without having to get into any programming
adjustments.

> Some users may also want to extract the titles from their PDFs, as
> Sylvain's code did.

The "title" field located in a pdf is not as meaningful as one would like. As far as
I know, there is no consistent way to extract the real title of a document. How does
Adobe expect people to be able to index large numbers of PDF's?

> Anyway, here's Derek's fix for my concatenation problem:
>
> --- xpdf/TextOutputDev.cc.deltax Fri Nov 27 21:42:16 1998
> +++ xpdf/TextOutputDev.cc Thu Feb 25 09:55:28 1999
> @@ -217,6 +217,7 @@ void TextPage::addChar(GfxState *state,
> double x1, y1, w1, h1;
>
> state->transform(x, y, &x1, &y1);
> + dx -= state->getCharSpace();
> state->transformDelta(dx, dy, &w1, &h1);
> curStr->addChar(state, x1, y1, w1, h1, c, useASCII7);
> }

This patch worked! I tested it on the the "profile_rob_98.txt" and the output was
much better. Kuddos to Derek!
Thanks to Gilles for getting in touch.

Pat :)

------------------------------------
To unsubscribe from the htdig mailing list, send a message to
htdig@htdig.org containing the single word "unsubscribe" in
the SUBJECT of the message.



This archive was generated by hypermail 2.0b3 on Thu Mar 04 1999 - 09:09:18 PST