[htdig] Correction to patch for Acrobat 4


Gilles Detillieux (grdetil@scrc.umanitoba.ca)
Wed, 18 Aug 1999 11:18:05 -0500 (CDT)


Hi again, folks. I made a silly mistake in my patch last Friday, August
13, to support Acrobat 4. Here's the fix for that mistake:

--- htdig/PDF.cc.bug Tue Aug 17 11:07:17 1999
+++ htdig/PDF.cc Wed Aug 18 09:22:28 1999
@@ -109,7 +109,7 @@ PDF::parse(Retriever &retriever, URL &ur
     if (notfound) // we only need to complain once
         return;
     String arg0 = acroread;
- char *endarg = strchr(acroread.get(), ' ');
+ char *endarg = strchr(arg0.get(), ' ');
     if (endarg)
         *endarg = '\0';
     // If first arg is a path, check that it exists, and is a regular file.

It turns out that even without the -pairs option, acroread 4 is still
prone to segmentation violations when generating PostScript, so acroread 3
is a better choice anyway. However, this fix handles a few other problems
with pdf_parser handling, and you may find that Acrobat 4 works OK with
your files. Hopefully Adobe will fix these problems before too long.

Also, if you applied last Friday's patch after applying the patch file
collection I sent out last Monday, August 9, there's a hunk that would
have failed to apply to htdoc/attrs.html, because of a conflicting
change in the patch file collection. You can correct that by applying
the patch below (as well as the one above) after Friday's patch.

--- htdig-3.1.2/htdoc/attrs.html.orig Fri Aug 6 14:00:28 1999
+++ htdig-3.1.2/htdoc/attrs.html Tue Aug 17 10:55:45 1999
@@ -4283,14 +4283,33 @@
                       <em>infile outfile</em>,<br>
                       where <em>infile</em> is a file to parse and
                       <em>outfile</em> is the PostScript output of the
- parser. The program is supposed to convert to a
+ parser. In the case where acroread is the parser, and
+ the -pairs option is not given, the second parameter
+ will be the output directory rather than the output
+ file name. The program is supposed to convert to a
                       variant of PostScript, which is then parsed
- internally. Currently, Adobe's <a
+ internally. Currently, only Adobe's <a
                       href="http://www.adobe.com/prodindex/acrobat/readstep.html">
- acroread</a> program and the pdftops program
- that is part of the <a
+ acroread</a> program has been tested as a pdf_parser.
+ There is a bug in Acrobat 4's acroread command, which
+ causes it to fail when -pairs is used, hence the special
+ case above.<br>
+ The pdftops program that is part of the <a
                       href="http://www.foolabs.com/xpdf/">xpdf</a>
- 0.80 package have been tested as pdf_parsers.
+ package is not suitable as a pdf_parser,
+ because its variant of PostScript is slightly
+ different. However, an alternative is to
+ use xpdf's pdftotext program as a component
+ of an <a href="#external_parsers">external
+ parser</a> with the xpdf 0.90 package installed
+ on your system, as described in FAQ question <a
+ href="FAQ.html#q4.9">4.9</a>.<br>
+ In either case, to successfully index PDF files,
+ be sure to set the <a
+ href="#max_doc_size">max_doc_size</a> attribute
+ to a value larger than the size of your largest
+ PDF file. PDF documents can not be parsed if they
+ are truncated.
                         <p>
                           The default value of this attribute is determined at
                           compile time, to include the path to the acroread

-- 
Gilles R. Detillieux              E-mail: <grdetil@scrc.umanitoba.ca>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:    (204)789-3930

------------------------------------ To unsubscribe from the htdig mailing list, send a message to htdig@htdig.org containing the single word unsubscribe in the SUBJECT of the message.



This archive was generated by hypermail 2.0b3 on Wed Aug 18 1999 - 09:19:38 PDT