Re: [htdig3-dev] feedback on ht://Dig documentation


Subject: Re: [htdig3-dev] feedback on ht://Dig documentation
From: Gilles Detillieux (grdetil@scrc.umanitoba.ca)
Date: Wed Nov 17 1999 - 14:42:18 PST


According to Tom Metro:
> I think that's confusing the way the Acrobat 4 bug work-around info is
> weaved into the general information. Try something more like:

I agree. What you propose seems to be a big improvement.

> pdf_parser
> type:
> string
> used by:
> htdig
> default:
> <path>/acroread -toPostScript
> description:
> Set this to the path of the program used to parse PDF files,
> including all command-line options. The program will be
> called with the parameters:
> infile outfile,
> where infile is a file to parse and outfile is the
> PostScript output of the parser.
>
> The program is supposed to convert to a variant of
> PostScript, which is then parsed internally. Currently, only
> Adobe's acroread program has been tested as a pdf_parser.
>
> The default value of <path> is determined at compile time,
> to include the path to the acroread executable. [What if
> acroread isn't found?]

I believe that if configure doesn't find acroread, it still puts in
/usr/local/bin/acroread as default.

> To successfully index PDF files, be sure to set the
> max_doc_size attribute to a value larger than the size of
> your largest PDF file. PDF documents can not be parsed if
> they are truncated.
>
> Note: There is a bug in Acrobat 4's acroread command, which
> causes it to fail when -pairs is used. Ht://Dig version
> 3.??? and later include a work-around for this bug such

3.1.3

> that when acroread is the parser, and the -pairs option is
> not given, the second parameter will be the output directory
> rather than the output file name. [Does ht://Dig really
> specify a second parameter? It seems that if -pairs is
> omitted, acroread wants just one parameter, and it
> auto-generates a target file name by changing the extension
> of the input file name.]

Yes, it does. My understanding is acroread only needs one parameter,
but if given more, and the last one is a directory, it is taken as the
target directory into which the .ps files are stored.

> The pdftops program that is part of the xpdf package is not
> suitable as a pdf_parser, because its variant of PostScript
> is slightly different. However, an alternative is to use
> xpdf's pdftotext program as a component of an external
> parser with the xpdf 0.90 package installed on your system,
> as described in FAQ question 4.9.
>
> example:
> pdf_parser: /usr/local/Acrobat3/bin/acroread \
> -toPostScript -pairs
> or
> pdf_parser: /usr/local/Acrobat4/bin/acroread \
> -toPostScript
>
> How do you disable the pdf_parser? It's certainly conceivable that
> someone may have PDF files in their web, but no parser and would like
> to suppress the error messages. Setting exclude_urls is one way, but
> there's probably a better one that should be mentioned in the
> documentation above.

With 3.1.3 and later versions, the pdf_parser sort of disables itself.
If given a full pathname to acroread, htdig will try to see if the file
exists, and if not, it will only complain once and not try again. If you
don't want htdig to even attempt to index PDFs, you should add .pdf to
bad_extensions. Right now, that's about the only way. There isn't any
setting of pdf_parser that completely disables the parser. I think as
xpdf improves, the semi-builtin PDF parser will become increasingly
irrelevant, and may be removed from future versions, but for now, there
are still some people who continue to use it.

> FYI, I found that acroread v.4 (running on Linux), even without the
> -pairs option, choked on several PDF files that otherwise display fine
> on other platforms with the Acrobat Reader v.3 or v.4 (don't know if
> they display OK on Linux, and -toPostScript isn't supported on Win32).
> Rolling back to acroread v.3 seemed to solve this problem.

Yes, acroread 4 for Linux is definitely buggy. There are files it
displays fine, but crashes on when converting to PostScript. There
have been reports of similar behaviour in acroread 3, but for the
most part, 3 seems much more solid than 4.

As for Win32, I wonder if users would have more luck building xpdf's
pdftotext, and using in in an external parser.

> On this page:
> http://www.htdig.org/FAQ.html
>
> this FAQ answer appears out of date:
>
> 5.2. I can't index PDF files.
>
> As above, this usually has to do with the default document
> size. What happens is ht://Dig will read in part of a PDF
> file and try to index it. This usually fails. Try setting
> "max_doc_size" in your config file to a larger value than
> the size of your largest PDF file.
>
> Another common problem is that htdig can't find the acroread
> program, which it uses to convert PDF files to PostScript.
> The solution is to obtain and install Adobe Acrobat Reader
> 3.0, if it's available for your system. You may also need to
> set the pdf_parser attribute to the correct location and
> options for acroread. There is apparently a bug in Adobe
> Acrobat Reader version 4, in its handling of the -pairs
> option, which causes a segmentation violation when using it
> with htdig, so it is not suitable as a PDF parser. An
> alternative is to use an external parser with the xpdf 0.90
> package installed on your system, as described in question
> 4.9 above.
>
> It fails to mention that ht://Dig has a work-around for Acrobat Reader
> version 4 now.

Yes, that was an oversight on my part. I meant to fix this for 3.1.3,
but my vacation got in the way of that, and then I forgot. The text
should instead mention the general unreliability of acroread 4.

> Also, question 4.9 should probably reference this
> question.

Not a bad idea.

> And this answer:
>
> 5.11. When I run htsearch, it gives me a count of matches,
> but doesn't list the matching documents.
>
> This is usually an indication of a corrupted database. If
> it's finding matches, it's because it found the matching
> words in db.words.db. However, it isn't finding the document
> records themselves in db.docdb, which would suggest that
> either db.docdb, or db.docs.index (which maps document IDs
> used in db.words.db to URLs used to look up records in
> db.docdb), is messed up. You'll likely need to rebuild your
> database from scratch. Older versions of ht://Dig were
> susceptible to database corruption of this sort. Versions
> 3.1.2 and later are much more stable.
>
> should mention that a currently running database rebuild (running
> rundig) apparently can cause the same symptom. And this, I would
> think, is a more likely situation for a user to encounter.

Very good point, and one I totally overlooked when writing this. I'll
get the FAQ updates in as soon as I can. In the meantime, if you can
edit in your changes to htdig-3.1.3's htdoc/attrs.html, and send in your
changes as a patch file, that would be a big help. I'll be committing
several patches to the 3.1.x source tree, hopefully by next week, for
an upcoming 3.1.4 maintenance release.

-- 
Gilles R. Detillieux              E-mail: <grdetil@scrc.umanitoba.ca>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:    (204)789-3930

------------------------------------ To unsubscribe from the htdig3-dev mailing list, send a message to htdig3-dev-unsubscribe@htdig.org You'll receive a message confirming the unsubscription.



This archive was generated by hypermail 2b25 : Wed Nov 17 1999 - 14:53:38 PST