Re: [htdig] PDF indexing


Gilles Detillieux (grdetil@scrc.umanitoba.ca)
Fri, 13 Aug 1999 16:25:25 -0500 (CDT)


According to burditt@okstate.edu:
> It says my PDF files are encryped and xpdf doesnt do encryped files.
>
> Do I have any alternatives to get it to include the PDF files in the
> search?

There are a few alternatives. There are decryption patches available
for xpdf. There should be a link to this on the foolabs.com web site.
That may do the trick for you.

Another option would be to install Acrobat 3, and use it to index your
PDFs.

A third option - and this is the one I'd like you to try first -
is to apply the following patch to htdig 3.1.2, to get it to work
with Acrobat 4. You'll also have to remove the -pairs option from any
pdf_parser definition you've added to your htdig.conf file, if you defined
this attribute. I've made the change to the 3.2 development source,
but haven't had a chance to test it out yet. (Getting the current 3.2
development source to compile and install successfully on my system
is proving to be a major ordeal.) I'd appreciate any feedback as to
whether this works.

In the htdig-3.1.2 source directory, "patch -p < {this e-mail message}":
---------------------
--- htdig/PDF.cc.orig Tue Mar 23 17:17:33 1999
+++ htdig/PDF.cc Fri Aug 13 16:05:16 1999
@@ -104,13 +104,22 @@ PDF::parse(Retriever &retriever, URL &ur
         acroread = "acroread";
 
     // Check for existance of acroread program! (if not, return)
- //struct stat stat_buf;
- // Check that it exists, and is a regular file.
- //if ((stat(acroread, &stat_buf) == -1) || !S_ISREG(stat_buf.st_mode))
- // {
- // printf("PDF::parse: cannot find acroread\n");
- // return;
- // }
+ struct stat stat_buf;
+ static int notfound = 0;
+ if (notfound) // we only need to complain once
+ return;
+ String arg0 = acroread;
+ char *endarg = strchr(acroread.get(), ' ');
+ if (endarg)
+ *endarg = '\0';
+ // If first arg is a path, check that it exists, and is a regular file.
+ if (strchr(arg0.get(), '/') &&
+ ((stat(arg0.get(), &stat_buf) == -1) || !S_ISREG(stat_buf.st_mode)))
+ {
+ printf("PDF::parse: cannot find pdf parser %s\n", arg0.get());
+ notfound = 1;
+ return;
+ }
 
     // Write the pdf contents in a temp file to give it to acroread
 
@@ -140,9 +149,19 @@ PDF::parse(Retriever &retriever, URL &ur
 
 
     // Use acroread as a filter to convert to PostScript.
- // Now generalized to allow xpdf as a parser (works with most recent xpdf)
+ // Now generalized to allow xpdf as a parser, or other compatible parsers
+ // (It was claimed it works with most recent xpdf, but it doesn't!)
     // acroread << " -toPostScript " << pdfName << " " << tmpdir << " 2>&1";
- acroread << " " << pdfName << " " << psName << " 2>&1";
+ String dest = psName;
+ if (strstr(acroread.get(), "acroread"))
+ {
+ // special-case tests only for acroread (what else you gonna use?)
+ if (!strstr(acroread.get(), "-toPostScript"))
+ acroread << " -toPostScript "; // add missing option
+ if (!strstr(acroread.get(), "-pairs")) // don't use -pairs with 4.0
+ dest = tmpdir;
+ }
+ acroread << " " << pdfName << " " << dest << " 2>&1";
 
     if (system(acroread))
     {
--- htcommon/defaults.cc.orig Thu Mar 25 11:49:40 1999
+++ htcommon/defaults.cc Fri Aug 13 16:05:16 1999
@@ -21,7 +21,7 @@ ConfigDefaults defaults[] =
     {"database_dir", DATABASE_DIR},
     {"bin_dir", BIN_DIR},
     {"image_url_prefix", IMAGE_URL_PREFIX},
- {"pdf_parser", PDF_PARSER " -toPostScript -pairs"},
+ {"pdf_parser", PDF_PARSER " -toPostScript"},
     {"version", VERSION},
 
     //
--- htdoc/attrs.html.orig Thu Mar 25 11:35:03 1999
+++ htdoc/attrs.html Fri Aug 13 16:08:20 1999
@@ -4025,7 +4025,7 @@
                         <em>default:</em>
                   </dt>
                   <dd>
- acroread -toPostScript -pairs
+ acroread -toPostScript
                   </dd>
                   <dt>
                         <em>description:</em>
@@ -4037,20 +4037,44 @@
                       <em>infile outfile</em>,<br>
                       where <em>infile</em> is a file to parse and
                       <em>outfile</em> is the PostScript output of the
- parser. The program is supposed to convert to a
+ parser. In the case where acroread is the parser, and
+ the -pairs option is not given, the second parameter
+ will be the output directory rather than the output
+ file name. The program is supposed to convert to a
                       variant of PostScript, which is then parsed
- internally. Currently, Adobe's <a
+ internally. Currently, only Adobe's <a
                       href="http://www.adobe.com/prodindex/acrobat/readstep.html">
- acroread</a> program and the pdftops program
- that is part of the <a
+ acroread</a> program has been tested as a pdf_parser.
+ There is a bug in Acrobat 4's acroread command, which
+ causes it to fail when -pairs is used, hence the special
+ case above.<br>
+ The pdftops program that is part of the <a
                       href="http://www.foolabs.com/xpdf/">xpdf</a>
- 0.80 package have been tested as pdf_parsers.
+ package is not suitable as a pdf_parser,
+ because its variant of PostScript is slightly
+ different. However, an alternative is to
+ use xpdf's pdftotext program as a component
+ of an <a href="#external_parsers">external
+ parser</a> with the xpdf 0.90 package installed
+ on your system, as described in FAQ question <a
+ href="FAQ.html#q4.9">4.9</a>.<br>
+ In either case, to successfully index PDF files,
+ be sure to set the <a
+ href="#max_doc_size">max_doc_size</a> attribute
+ to a value larger than the size of your largest
+ PDF file. PDF documents can not be parsed if they
+ are truncated.
+ <p>
+ The default value of this attribute is determined at
+ compile time, to include the path to the acroread
+ executable.
+ </p>
                   </dd>
                   <dt>
                         <em>example:</em>
                   </dt>
                   <dd>
- pdf_parser: /usr/local/bin/acroread -toPostScript -pairs
+ pdf_parser: /usr/local/Acrobat3/bin/acroread -toPostScript -pairs
                   </dd>
                 </dl>
           </dd>
---------------------

-- 
Gilles R. Detillieux              E-mail: <grdetil@scrc.umanitoba.ca>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:    (204)789-3930

------------------------------------ To unsubscribe from the htdig mailing list, send a message to htdig@htdig.org containing the single word unsubscribe in the SUBJECT of the message.



This archive was generated by hypermail 2.0b3 on Fri Aug 13 1999 - 14:26:21 PDT