Re: [htdig] Using pdftotext to index PDF documents


Gilles Detillieux (grdetil@scrc.umanitoba.ca)
Thu, 25 Feb 1999 15:41:18 -0600 (CST)


> There's still a bit more work to be done. Patrick mentioned that
> pdftotext changed hyphens to spaces. Not so, but parse_doc.pl does.
> In fact, it converts all punctuation to spaces, to separate out the words.
> The problem is right now, the word list is what it spits out for the
> "h" record as well. So there's no punctuation at all in the excerpts!

OK, here's take 2 on my parse_doc.pl patch, to support pdftotext.
Apart from some cleaning up, and the same additions as my earlier (and
now obsolete) patch, it builds a separate string for the head record,
with processing on it equivalent to what htdig does on plain text files.

It seems to work like a charm on my PDFs (with the patch to pdftotext
I posted earlier). I'd like a few other PDF users to try it out as an
external parser for application/pdf documents on their systems. Also,
if anyone with more perl experience than me (going on a few hours now)
can critique the code - either my changes or the original code - I'd
appreciate the edification.

You can pick up the latest script from

        http://www.scrc.umanitoba.ca/htdig/rpms/parse_doc.pl

or apply the patch below. This patch should be applied to the original
contrib/parse_doc.pl shipped with htdig-3.1.1.tar.gz:

--- contrib/parse_doc.pl.nopdf Tue Feb 16 23:03:39 1999
+++ contrib/parse_doc.pl Thu Feb 25 15:16:43 1999
@@ -10,9 +10,15 @@
 # Changed: push line semi-colomn wrong. <carl@dpiwe.tas.gov.au>
 # Changed: matching works for end of lines now <carl@dpiwe.tas.gov.au>
 # Added: option to rigorously delete all punctuation <carl@dpiwe.tas.gov.au>
+#
+# 1999/02/09
 # Added: option to delete all hyphens <grdetil@scrc.umanitoba.ca>
-# Changed: uses ps2ascii to handle PS files <grdetil@scrc.umanitoba.ca>
+# Added: uses ps2ascii to handle PS files <grdetil@scrc.umanitoba.ca>
+# 1999/02/15
 # Added: check for some file formats <Frank.Richter@hrz.tu-chemnitz.de>
+# 1999/02/25
+# Added: uses pdftotext to handle PDF files <grdetil@scrc.umanitoba.ca>
+# Changed: generates a head record with punct. <grdetil@scrc.umanitoba.ca>
 #########################################
 #
 # set this to your MS Word to text converter
@@ -34,8 +40,14 @@
 # get it from the ghostscript 3.33 (or later) package
 #
 $CATPS = "/usr/bin/ps2ascii";
+#
+# set this to your PDF to text converter
+# get it from the xpdf 0.80 package at http://www.foolabs.com/xpdf/
+#
+$CATPDF = "/usr/bin/pdftotext";
 
 # need some var's
+$head = "";
 @allwords = ();
 @temp = ();
 $x = 0;
@@ -57,6 +69,10 @@
         $parser = $CATPS; # gs 3.33 leaves _temp_.??? files in .
         $parsecmd = "(cd /tmp; $parser; rm -f _temp_.???) < $ARGV[0] |";
         $type = "PostScript";
+} elsif ($magic =~ /%PDF-/) { # it's PDF (Acrobat)
+ $parser = $CATPDF;
+ $parsecmd = "$parser $ARGV[0] - |";
+ $type = "PDF";
 } elsif ($magic =~ /WPC/) { # it's WordPerfect
         $parser = $CATWP;
         $parsecmd = "$parser $ARGV[0] |";
@@ -77,6 +93,7 @@
 # open it
 open(CAT, "$parsecmd") || die "Hmmm. $parser doesn't want to be opened using pipe.\n";
 while (<CAT>) {
+ $head .= " " . $_;
         s/\s[\(\)\[\]\\\/\^\;\:\"\'\`\.\,\?!\*]|[\(\)\[\]\\\/\^\;\:\"\'\`\.\,\?!\*]\s|^[\(\)\[\]\\\/\^\;\:\"\'\`\.\,\?!\*]|[\(\)\[\]\\\/\^\;\:\"\'\`\.\,\?!\*]$/ /g; # replace reading-chars with space (only at end or begin of word)
 # s/[\(\)\[\]\\\/\^\;\:\"\'\`\.\,\?!\*]/ /g; # rigorously replace all by <carl@dpiwe.tas.gov.au>
         s/-/ /g; # replace hyphens with space
@@ -101,15 +118,22 @@
 
 #############################################
 # print out the head
-$calc = @allwords;
-print "h\t";
-#if ($calc >100) { # but not more than 100 words
-# $calc = 100;
+$head =~ s/^\s+//g;
+$head =~ s/\s+$//g;
+$head =~ s/\s+/ /g;
+$head =~ s/&/\&amp\;/g;
+$head =~ s/</\&lt\;/g;
+$head =~ s/>/\&gt\;/g;
+print "h\t$head\n";
+#$calc = @allwords;
+#print "h\t";
+##if ($calc >100) { # but not more than 100 words
+## $calc = 100;
+##}
+#for ($x=0; $x<$calc; $x++) { # print out the words for the exerpt
+# print "$allwords[$x] ";
 #}
-for ($x=0; $x<$calc; $x++) { # print out the words for the exerpt
- print "$allwords[$x] ";
-}
-print "\n";
+#print "\n";
 
 
 #############################################

-- 
Gilles R. Detillieux              E-mail: <grdetil@scrc.umanitoba.ca>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:    (204)789-3930
------------------------------------
To unsubscribe from the htdig mailing list, send a message to
htdig@htdig.org containing the single word "unsubscribe" in
the SUBJECT of the message.



This archive was generated by hypermail 2.0b3 on Fri Feb 26 1999 - 14:34:13 PST