[htdig] Re: parse_doc.pl + pdftotext = El Perfecto -.0001:*)


Gilles Detillieux (grdetil@scrc.umanitoba.ca)
Mon, 1 Mar 1999 12:06:21 -0600 (CST)


According to Joe R. Jah:
> El Perfecto:
>
> Thank you very much for your giant leap for PDF kind;)
>
> I applied your second patch to parse_doc.pl and Derek's fix to
> xpdf/TextOutputDev.cc; now all the PDF files in my search path are indexed
> using the external parser directive in the config file:
>
> external_parsers: application/msword /usr/local/bin/parse_doc.pl \
> application/postscript /usr/local/bin/parse_doc.pl \
> application/pdf /usr/local/bin/parse_doc.pl

Thanks, but most of the credit belongs to Derek Noonburg, author of
pdftotext, who made this possible. I've since discovered another
problem with using pdftotext for indexing PDFs. It's too clever in
handling multi-column PDF documents - it spits out the plain text in
a multi-column format, instead of "unravelling" the columns. This
makes for odd looking document excerpts. I'm going to see if I can
find a way around this, and maybe as Derek for assistance again.

> -.0001:
>
> One crappy PDF file creates a score of errors during the dig:
>
> External parser error in line:w^@(Garbage)*
>
> It also appears in the search results as:
>
> Word Document prereg.pdf
>
> instead of
>
> PDF Document prereg.pdf

Ick! The problem is that parse_doc.pl started out as parse_word_doc.pl,
and was meant for MS Word documents only. As I added other file types,
I kept MS Word as default. So, for anything it doesn't recognise with
its simple magic number or magic string tests, it passes on to catdoc,
which tends to barf on anything that isn't really a MS Word document.
The default case has to go, and an explicit test for Word document
magic numbers is needed.

> The file is:
>
> http://www.ccsf.cc.ca.us/Resources/Title3/training/prereg.pdf
>
> It can be searched with:
>
> http://www.ccsf.cc.ca.us/cgi-bin/htsearch?config=htdig&restrict=\
> &exclude=&words=pre-registration+form&method=and&format=builtin-short
>
> No other word in that file gives a search result, I guess the error had
> happened at the top of the file after the line Pre-Registration Form.

Or that was the only usable text string that catdoc managed to extract
from the PDF file.

The problem is your prereg.pdf file has a MacBinary header on it. I had
a similar problem with a PostScript file that had an HP print job wrapper
on it - even though ghostscript didn't have a problem with the wrapper,
parse_doc.pl did, so it passed the file to catdoc instead of ps2ascii.

Here's a patch to parse_doc.pl, which must be applied after my earlier
(Feb. 25?) patch for PDF support. It tests for MacBinary wrappers,
which pdftotext seems to know how to skip. (pdftotext just looks for
%PDF- in the first 1K block of the file - more than enough to skip the
128 byte MacBinary header.) My patch also allows HP job wrappers on
PS files, and explicitly tests for MS Word documents' magic number.
You can always grab my latest version of parse_doc.pl from

        http://www.scrc.umanitoba.ca/htdig/rpms/parse_doc.pl

or use this patch.

--- parse_doc.pl.docdef Thu Feb 25 15:16:43 1999
+++ parse_doc.pl Mon Mar 1 10:54:23 1999
@@ -19,6 +19,9 @@
 # 1999/02/25
 # Added: uses pdftotext to handle PDF files <grdetil@scrc.umanitoba.ca>
 # Changed: generates a head record with punct. <grdetil@scrc.umanitoba.ca>
+# 1999/03/01
+# Added: extra checks for file "wrappers" <grdetil@scrc.umanitoba.ca>
+# & check for MS Word signature (no longer defaults to catdoc)
 #########################################
 #
 # set this to your MS Word to text converter
@@ -65,26 +68,40 @@
 read FILE,$magic,8;
 close FILE;
 
-if ($magic =~ /%!/) { # it's PostScript
+if ($magic =~ /^\0\n/) { # possible MacBinary header
+ open(FILE, "< $ARGV[0]") || die "Oops. Can't open file $ARGV[0]: $!\n";
+ read FILE,$magic,136; # let's hope parsers can handle them!
+ close FILE;
+}
+
+if ($magic =~ /%!|^\033%-12345/) { # it's PostScript (or HP print job)
         $parser = $CATPS; # gs 3.33 leaves _temp_.??? files in .
         $parsecmd = "(cd /tmp; $parser; rm -f _temp_.???) < $ARGV[0] |";
         $type = "PostScript";
+ if ($magic =~ /^\033%-12345/) { # HP print job
+ open(FILE, "< $ARGV[0]") || die "Oops. Can't open file $ARGV[0]: $!\n";
+ read FILE,$magic,256;
+ close FILE;
+ exit unless $magic =~ /^\033%-12345X\@PJL.*\n*.*\n*.*ENTER LANGUAGE = POSTSCRIPT.*\n*.*\n*.*\n%!/
+ }
 } elsif ($magic =~ /%PDF-/) { # it's PDF (Acrobat)
         $parser = $CATPDF;
         $parsecmd = "$parser $ARGV[0] - |";
         $type = "PDF";
-} elsif ($magic =~ /WPC/) { # it's WordPerfect
+} elsif ($magic =~ /WPC/) { # it's WordPerfect
         $parser = $CATWP;
         $parsecmd = "$parser $ARGV[0] |";
         $type = "WordPerfect";
-} elsif ($magic =~ /^{\\rtf/) { # it's Richtext
+} elsif ($magic =~ /^{\\rtf/) { # it's Richtext
         $parser = $CATRTF;
         $parsecmd = "$parser $ARGV[0] |";
         $type = "RTF";
-} else { # assume it's MS Word
+} elsif ($magic =~ /\320\317\021\340/) { # it's MS Word
         $parser = $CATDOC;
         $parsecmd = "$parser -a -w $ARGV[0] |";
         $type = "Word";
+} else {
+ die "Can't determine type of file $ARGV[0]\n";
 }
 # print STDERR "$ARGV[0]: $type $parsecmd\n";
 die "Hmm. $parser is absent or unwilling to execute.\n" unless -x $parser;

-- 
Gilles R. Detillieux              E-mail: <grdetil@scrc.umanitoba.ca>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:    (204)789-3930
------------------------------------
To unsubscribe from the htdig mailing list, send a message to
htdig@htdig.org containing the single word "unsubscribe" in
the SUBJECT of the message.



This archive was generated by hypermail 2.0b3 on Thu Mar 04 1999 - 09:09:18 PST