Re: [htdig] PDF & ISO-Latin chars


Gilles Detillieux (grdetil@scrc.umanitoba.ca)
Fri, 13 Aug 1999 12:40:05 -0500 (CDT)


According to Antti Rauramo:
> Yep, great, using xpdf's pdftotext helped! Now also searching pdf's works
> flawlessly! Thank you!

Glad to help.

> > You may want to adapt the script to extract titles from PDFs using
> > pdfinfo, if the titles matter to you. (That's something on my to-do
> > list I can't seem to find the time for.)
>
> Oop, heh, didn't read up to here before already adding a part to parse_doc.pl which
> reads the pdf and finds the title. (Though this may have problems w/ crypted pdf's)
> Here's the cut beginning around line 152...
>
>
> #############################################
> # print out the title
> #@temp = split(/\//, $ARGV[2]); # get the filename, get rid of basename
> #print "t\t$type Document $temp[-1]\n"; # print it
>
> ### 13-08-1999 ant
> open(TITLEIN,"<$ARGV[0]") || print STDERR "$ARGV[0]: $!\n";
> while(<TITLEIN>){
> if(/title/i){
> ($pdftitle)=/\/Title \(([^\/)]+)[\/\)]/i;
> $pdftitle && close TITLEIN;
> }
> }
> close TITLEIN;
>
> $pdftitle=~s/\\(\d\d\d)/pack(c,oct($1))/ge;
> if(!$pdftitle){ $pdftitle="$type Document $temp[-1]"; }
> print "t\t$pdftitle\n";

I don't know how well pdftotext and pdfinfo deal with encrypted PDFs either.
I think they need patches for this, and somehow need to be given the
decryption keys.

I do see a problem with your approach, though. The first /Title definition
isn't necessarily the one you want. It all depends on how the dictionaries
are laid out in the PDF. Here are my recent changes to parse_doc.pl, which
I posted to http://www.htdig.org/files/contrib/parsers/ and to the 3.2
source tree:

Index: contrib/parse_doc.pl
===================================================================
RCS file: /opt/htdig/cvs/htdig3/contrib/parse_doc.pl,v
retrieving revision 1.5
retrieving revision 1.6
diff -u -r1.5 -r1.6
--- contrib/parse_doc.pl 1999/03/22 21:39:46 1.5
+++ contrib/parse_doc.pl 1999/08/12 22:11:38 1.6
@@ -27,6 +27,11 @@
 # (in PDFs) & remove multiple punct. chars. between words (all)
 # 1999/03/10
 # Changed: fix handling of minimum word length <grdetil@scrc.umanitoba.ca>
+# 1999/08/12
+# Changed: adapted for xpdf 0.90 release <grdetil@scrc.umanitoba.ca>
+# Added: uses pdfinfo to handle PDF titles <grdetil@scrc.umanitoba.ca>
+# Changed: keep hyphens by default, as htdig <grdetil@scrc.umanitoba.ca>
+# does, but change dashes to hyphens
 #########################################
 #
 # set this to your MS Word to text converter
@@ -49,11 +54,13 @@
 #
 $CATPS = "/usr/bin/ps2ascii";
 #
-# set this to your PDF to text converter
-# get it from the xpdf 0.80 package at http://www.foolabs.com/xpdf/
+# set this to your PDF to text converter, and pdfinfo tool
+# get it from the xpdf 0.90 package at http://www.foolabs.com/xpdf/
 #
 $CATPDF = "/usr/bin/pdftotext";
+$PDFINFO = "/usr/bin/pdfinfo";
 #$CATPDF = "/usr/local/bin/pdftotext";
+#$PDFINFO = "/usr/local/bin/pdfinfo";
 
 # need some var's
 $minimum_word_length = 3;
@@ -64,6 +71,7 @@
 @fields = ();
 $calc = 0;
 $dehyphenate = 0;
+$title = "";
 #
 # okay. my programming style isn't that nice, but it works...
 
@@ -97,11 +105,25 @@
         }
 } elsif ($magic =~ /%PDF-/) { # it's PDF (Acrobat)
         $parser = $CATPDF;
- $parsecmd = "$parser $ARGV[0] - |";
-# kludge to handle multi-column PDFs... (needs patched pdftotext)
-# $parsecmd = "$parser -rawdump $ARGV[0] - |";
+ $parsecmd = "$parser -raw $ARGV[0] - |";
+# to handle single-column, strangely laid out PDFs, use coalescing feature...
+# $parsecmd = "$parser $ARGV[0] - |";
         $type = "PDF";
         $dehyphenate = 1; # PDFs often have hyphenated lines
+ if (open(INFO, "$PDFINFO $ARGV[0] 2>/dev/null |")) {
+ while (<INFO>) {
+ if (/^Title:/) {
+ $title = $_;
+ $title =~ s/^Title:\s+(.*[^\s])\s*$/$1/;
+ $title =~ s/\s+/ /g;
+ $title =~ s/&/\&amp\;/g;
+ $title =~ s/</\&lt\;/g;
+ $title =~ s/>/\&gt\;/g;
+ break;
+ }
+ }
+ close INFO;
+ }
 } elsif ($magic =~ /WPC/) { # it's WordPerfect
         $parser = $CATWP;
         $parsecmd = "$parser $ARGV[0] |";
@@ -135,7 +157,8 @@
         s/\s+[\(\)\[\]\\\/\^\;\:\"\'\`\.\,\?!\*]+|[\(\)\[\]\\\/\^\;\:\"\'\`\.\,\?!\*]+\s+|^[\(\)\[\]\\\/\^\;\:\"\'\`\.\,\?!\*]+|[\(\)\[\]\\\/\^\;\:\"\'\`\.\,\?!\*]+$/ /g; # replace reading-chars with space (only at end or begin of word, but allow multiple characters)
 # s/\s[\(\)\[\]\\\/\^\;\:\"\'\`\.\,\?!\*]|[\(\)\[\]\\\/\^\;\:\"\'\`\.\,\?!\*]\s|^[\(\)\[\]\\\/\^\;\:\"\'\`\.\,\?!\*]|[\(\)\[\]\\\/\^\;\:\"\'\`\.\,\?!\*]$/ /g; # replace reading-chars with space (only at end or begin of word)
 # s/[\(\)\[\]\\\/\^\;\:\"\'\`\.\,\?!\*]/ /g; # rigorously replace all by <carl@dpiwe.tas.gov.au>
- s/[\-\255]/ /g; # replace hyphens with space
+# s/[\-\255]/ /g; # replace hyphens with space
+ s/[\255]/-/g; # replace dashes with hyphens
         @fields = split; # split up line
         next if (@fields == 0); # skip if no fields (does it speed up?)
         for ($x=0; $x<@fields; $x++) { # check each field if string length >= 3
@@ -150,15 +173,19 @@
 exit unless @allwords > 0; # nothing to output
 
 #############################################
-# print out the title
-@temp = split(/\//, $ARGV[2]); # get the filename, get rid of basename
-print "t\t$type Document $temp[-1]\n"; # print it
+# print out the title, if it's set, and not just a file name
+if ($title !~ /^$/ && $title !~ /^[A-G]:[^\s]+\.[Pp][Dd][Ff]$/) {
+ print "t\t$title\n";
+} else { # otherwise generate a title
+ @temp = split(/\//, $ARGV[2]); # get the filename, get rid of basename
+ print "t\t$type Document $temp[-1]\n"; # print it
+}
 
 
 #############################################
 # print out the head
-$head =~ s/^\s+//g;
-$head =~ s/\s+$//g;
+$head =~ s/^\s+//; # remove leading and trailing space
+$head =~ s/\s+$//;
 $head =~ s/\s+/ /g;
 $head =~ s/&/\&amp\;/g;
 $head =~ s/</\&lt\;/g;

-- 
Gilles R. Detillieux              E-mail: <grdetil@scrc.umanitoba.ca>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:    (204)789-3930

------------------------------------ To unsubscribe from the htdig mailing list, send a message to htdig@htdig.org containing the single word unsubscribe in the SUBJECT of the message.



This archive was generated by hypermail 2.0b3 on Fri Aug 13 1999 - 10:41:05 PDT