Re: [htdig] pdf parser: No error;) Search: No results;(


Gilles Detillieux (grdetil@scrc.umanitoba.ca)
Wed, 24 Feb 1999 12:19:50 -0600 (CST)


According to Rick Wiggins:
> Why can't we have it index a text version of the PDF? xpdf has a pdftotext
> utility that could be used. What is the advantage of converting to
> PostScript?

Well, for one thing, the PostScript includes information about character
spacing. As I mentioned earlier, some PDF files use large character
spacing as word spacing. I had modified PDF.cc to recognise and deal
with this in some PDF files I had. Without this patch, the words from
these documents were all concatenated together. I tried pdftotext on
these same documents, and the text it spat out also had the words
concatenated together, so it wouldn't be suitable for these documents.

Patrick Dugal just reported that pdftotext seemed to work OK with his
documents, but my patches to htdig/PDF.cc in 3.1.1 didn't seem to solve
the concatenation problem, so I'm assuming his problem was somewhat
different than the one I discovered. I'll follow up with him to get
to the bottom of this, but right now it seems neither approach is totally
reliable. I'm hoping I can change this.

For those out there who'd really like to use pdftotext to index their
PDF documents, here's a patch to contrib/parse_doc.pl (from the ht://Dig
3.1.1 source) that should make it handle PDFs:

--- contrib/parse_doc.pl.nopdf Tue Feb 16 23:03:39 1999
+++ contrib/parse_doc.pl Wed Feb 24 12:08:09 1999
@@ -34,6 +34,11 @@
 # get it from the ghostscript 3.33 (or later) package
 #
 $CATPS = "/usr/bin/ps2ascii";
+#
+# set this to your PDF to text converter
+# get it from the xpdf 0.80 package at http://www.foolabs.com/xpdf/
+#
+$CATPDF = "/usr/bin/pdftotext";
 
 # need some var's
 @allwords = ();
@@ -57,6 +62,10 @@
         $parser = $CATPS; # gs 3.33 leaves _temp_.??? files in .
         $parsecmd = "(cd /tmp; $parser; rm -f _temp_.???) < $ARGV[0] |";
         $type = "PostScript";
+} elsif ($magic =~ /%PDF-/) { # it's PDF (Acrobat)
+ $parser = $CATPDF;
+ $parsecmd = "$parser $ARGV[0] - |";
+ $type = "PDF";
 } elsif ($magic =~ /WPC/) { # it's WordPerfect
         $parser = $CATWP;
         $parsecmd = "$parser $ARGV[0] |";

If you have ghostscript 3.33 or later installed, with its ps2ascii
utility, then it'll index your PostScript documents for you too.
Of course, you'll need to customise the paths to any "whatever" to text
converters you have, at the start of the script, and install it in an
appropriate location. Then, you can add something like this to your
htdig.conf:

external_parsers: application/msword /usr/local/bin/parse_doc.pl \
                application/postscript /usr/local/bin/parse_doc.pl \
                application/pdf /usr/local/bin/parse_doc.pl

The last pair above will override htdig's internal PDF handling.

-- 
Gilles R. Detillieux              E-mail: <grdetil@scrc.umanitoba.ca>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:    (204)789-3930
------------------------------------
To unsubscribe from the htdig mailing list, send a message to
htdig@htdig.org containing the single word "unsubscribe" in
the SUBJECT of the message.



This archive was generated by hypermail 2.0b3 on Fri Feb 26 1999 - 14:34:12 PST