Subject: [htdig3-dev] feedback on ht://Dig documentation
From: Tom Metro (tmetro@vl.com)
Date: Wed Nov 17 1999 - 13:14:26 PST
On this page:
http://www.htdig.org/attrs.html#pdf_parser
it says:
pdf_parser
type:
string
used by:
htdig
default:
acroread -toPostScript
description:
Set this to the path of the program used to parse
PDF files, including all command-line options. The
program will be called with the parameters:
infile outfile,
where infile is a file to parse and outfile is the
PostScript output of the parser. In the case where
acroread is the parser, and the -pairs option is not
given, the second parameter will be the output
directory rather than the output file name. The
program is supposed to convert to a variant of
PostScript, which is then parsed internally.
Currently, only Adobe's acroread program has been
tested as a pdf_parser. There is a bug in Acrobat
4's acroread command, which causes it to fail when
-pairs is used, hence the special case above.
The pdftops program that is part of the xpdf
package is not suitable as a pdf_parser, because its
variant of PostScript is slightly different. However,
an alternative is to use xpdf's pdftotext program as a
component of an external parser with the xpdf 0.90
package installed on your system, as described in
FAQ question 4.9.
In either case, to successfully index PDF files, be
sure to set the max_doc_size attribute to a value
larger than the size of your largest PDF file. PDF
documents can not be parsed if they are truncated.
The default value of this attribute is determined at
compile time, to include the path to the acroread
executable.
example:
pdf_parser: /usr/local/Acrobat3/bin/acroread
-toPostScript -pairs
I think that's confusing the way the Acrobat 4 bug work-around info is
weaved into the general information. Try something more like:
pdf_parser
type:
string
used by:
htdig
default:
<path>/acroread -toPostScript
description:
Set this to the path of the program used to parse PDF files,
including all command-line options. The program will be
called with the parameters:
infile outfile,
where infile is a file to parse and outfile is the
PostScript output of the parser.
The program is supposed to convert to a variant of
PostScript, which is then parsed internally. Currently, only
Adobe's acroread program has been tested as a pdf_parser.
The default value of <path> is determined at compile time,
to include the path to the acroread executable. [What if
acroread isn't found?]
To successfully index PDF files, be sure to set the
max_doc_size attribute to a value larger than the size of
your largest PDF file. PDF documents can not be parsed if
they are truncated.
Note: There is a bug in Acrobat 4's acroread command, which
causes it to fail when -pairs is used. Ht://Dig version
3.??? and later include a work-around for this bug such
that when acroread is the parser, and the -pairs option is
not given, the second parameter will be the output directory
rather than the output file name. [Does ht://Dig really
specify a second parameter? It seems that if -pairs is
omitted, acroread wants just one parameter, and it
auto-generates a target file name by changing the extension
of the input file name.]
The pdftops program that is part of the xpdf package is not
suitable as a pdf_parser, because its variant of PostScript
is slightly different. However, an alternative is to use
xpdf's pdftotext program as a component of an external
parser with the xpdf 0.90 package installed on your system,
as described in FAQ question 4.9.
example:
pdf_parser: /usr/local/Acrobat3/bin/acroread \
-toPostScript -pairs
or
pdf_parser: /usr/local/Acrobat4/bin/acroread \
-toPostScript
How do you disable the pdf_parser? It's certainly conceivable that
someone may have PDF files in their web, but no parser and would like
to suppress the error messages. Setting exclude_urls is one way, but
there's probably a better one that should be mentioned in the
documentation above.
FYI, I found that acroread v.4 (running on Linux), even without the
-pairs option, choked on several PDF files that otherwise display fine
on other platforms with the Acrobat Reader v.3 or v.4 (don't know if
they display OK on Linux, and -toPostScript isn't supported on Win32).
Rolling back to acroread v.3 seemed to solve this problem.
On this page:
http://www.htdig.org/FAQ.html
this FAQ answer appears out of date:
5.2. I can't index PDF files.
As above, this usually has to do with the default document
size. What happens is ht://Dig will read in part of a PDF
file and try to index it. This usually fails. Try setting
"max_doc_size" in your config file to a larger value than
the size of your largest PDF file.
Another common problem is that htdig can't find the acroread
program, which it uses to convert PDF files to PostScript.
The solution is to obtain and install Adobe Acrobat Reader
3.0, if it's available for your system. You may also need to
set the pdf_parser attribute to the correct location and
options for acroread. There is apparently a bug in Adobe
Acrobat Reader version 4, in its handling of the -pairs
option, which causes a segmentation violation when using it
with htdig, so it is not suitable as a PDF parser. An
alternative is to use an external parser with the xpdf 0.90
package installed on your system, as described in question
4.9 above.
It fails to mention that ht://Dig has a work-around for Acrobat Reader
version 4 now. Also, question 4.9 should probably reference this
question.
And this answer:
5.11. When I run htsearch, it gives me a count of matches,
but doesn't list the matching documents.
This is usually an indication of a corrupted database. If
it's finding matches, it's because it found the matching
words in db.words.db. However, it isn't finding the document
records themselves in db.docdb, which would suggest that
either db.docdb, or db.docs.index (which maps document IDs
used in db.words.db to URLs used to look up records in
db.docdb), is messed up. You'll likely need to rebuild your
database from scratch. Older versions of ht://Dig were
susceptible to database corruption of this sort. Versions
3.1.2 and later are much more stable.
should mention that a currently running database rebuild (running
rundig) apparently can cause the same symptom. And this, I would
think, is a more likely situation for a user to encounter.
-Tom
-- Tom Metro Venture Logic tmetro@vl.com Newton, MA, USA------------------------------------ To unsubscribe from the htdig3-dev mailing list, send a message to htdig3-dev-unsubscribe@htdig.org You'll receive a message confirming the unsubscription.
This archive was generated by hypermail 2b25 : Wed Nov 17 1999 - 13:33:05 PST