[htdig3-dev] feedback on ht://Dig documentation


Subject: [htdig3-dev] feedback on ht://Dig documentation
From: Tom Metro (tmetro@vl.com)
Date: Wed Nov 17 1999 - 13:14:26 PST


On this page:
http://www.htdig.org/attrs.html#pdf_parser

it says:

  pdf_parser
     type:
          string
     used by:
          htdig
     default:
          acroread -toPostScript
     description:
          Set this to the path of the program used to parse
          PDF files, including all command-line options. The
          program will be called with the parameters:
          infile outfile,
          where infile is a file to parse and outfile is the
          PostScript output of the parser. In the case where
          acroread is the parser, and the -pairs option is not
          given, the second parameter will be the output
          directory rather than the output file name. The
          program is supposed to convert to a variant of
          PostScript, which is then parsed internally.
          Currently, only Adobe's acroread program has been
          tested as a pdf_parser. There is a bug in Acrobat
          4's acroread command, which causes it to fail when
          -pairs is used, hence the special case above.
          The pdftops program that is part of the xpdf
          package is not suitable as a pdf_parser, because its
          variant of PostScript is slightly different. However,
          an alternative is to use xpdf's pdftotext program as a
          component of an external parser with the xpdf 0.90
          package installed on your system, as described in
          FAQ question 4.9.
          In either case, to successfully index PDF files, be
          sure to set the max_doc_size attribute to a value
          larger than the size of your largest PDF file. PDF
          documents can not be parsed if they are truncated.

          The default value of this attribute is determined at
          compile time, to include the path to the acroread
          executable.

     example:
          pdf_parser: /usr/local/Acrobat3/bin/acroread
          -toPostScript -pairs

I think that's confusing the way the Acrobat 4 bug work-around info is
weaved into the general information. Try something more like:

  pdf_parser
     type:
          string
     used by:
          htdig
     default:
          <path>/acroread -toPostScript
     description:
        Set this to the path of the program used to parse PDF files,
        including all command-line options. The program will be
        called with the parameters:
                  infile outfile,
        where infile is a file to parse and outfile is the
        PostScript output of the parser.

        The program is supposed to convert to a variant of
        PostScript, which is then parsed internally. Currently, only
        Adobe's acroread program has been tested as a pdf_parser.

        The default value of <path> is determined at compile time,
        to include the path to the acroread executable. [What if
        acroread isn't found?]

        To successfully index PDF files, be sure to set the
        max_doc_size attribute to a value larger than the size of
        your largest PDF file. PDF documents can not be parsed if
        they are truncated.

        Note: There is a bug in Acrobat 4's acroread command, which
        causes it to fail when -pairs is used. Ht://Dig version
        3.??? and later include a work-around for this bug such
        that when acroread is the parser, and the -pairs option is
        not given, the second parameter will be the output directory
        rather than the output file name. [Does ht://Dig really
        specify a second parameter? It seems that if -pairs is
        omitted, acroread wants just one parameter, and it
        auto-generates a target file name by changing the extension
        of the input file name.]

        The pdftops program that is part of the xpdf package is not
        suitable as a pdf_parser, because its variant of PostScript
        is slightly different. However, an alternative is to use
        xpdf's pdftotext program as a component of an external
        parser with the xpdf 0.90 package installed on your system,
        as described in FAQ question 4.9.

     example:
          pdf_parser: /usr/local/Acrobat3/bin/acroread \
          -toPostScript -pairs
      or
          pdf_parser: /usr/local/Acrobat4/bin/acroread \
          -toPostScript

How do you disable the pdf_parser? It's certainly conceivable that
someone may have PDF files in their web, but no parser and would like
to suppress the error messages. Setting exclude_urls is one way, but
there's probably a better one that should be mentioned in the
documentation above.

FYI, I found that acroread v.4 (running on Linux), even without the
-pairs option, choked on several PDF files that otherwise display fine
on other platforms with the Acrobat Reader v.3 or v.4 (don't know if
they display OK on Linux, and -toPostScript isn't supported on Win32).
Rolling back to acroread v.3 seemed to solve this problem.

On this page:
http://www.htdig.org/FAQ.html

this FAQ answer appears out of date:

        5.2. I can't index PDF files.

        As above, this usually has to do with the default document
        size. What happens is ht://Dig will read in part of a PDF
        file and try to index it. This usually fails. Try setting
        "max_doc_size" in your config file to a larger value than
        the size of your largest PDF file.

        Another common problem is that htdig can't find the acroread
        program, which it uses to convert PDF files to PostScript.
        The solution is to obtain and install Adobe Acrobat Reader
        3.0, if it's available for your system. You may also need to
        set the pdf_parser attribute to the correct location and
        options for acroread. There is apparently a bug in Adobe
        Acrobat Reader version 4, in its handling of the -pairs
        option, which causes a segmentation violation when using it
        with htdig, so it is not suitable as a PDF parser. An
        alternative is to use an external parser with the xpdf 0.90
        package installed on your system, as described in question
        4.9 above.

It fails to mention that ht://Dig has a work-around for Acrobat Reader
version 4 now. Also, question 4.9 should probably reference this
question.

And this answer:

        5.11. When I run htsearch, it gives me a count of matches,
        but doesn't list the matching documents.

        This is usually an indication of a corrupted database. If
        it's finding matches, it's because it found the matching
        words in db.words.db. However, it isn't finding the document
        records themselves in db.docdb, which would suggest that
        either db.docdb, or db.docs.index (which maps document IDs
        used in db.words.db to URLs used to look up records in
        db.docdb), is messed up. You'll likely need to rebuild your
        database from scratch. Older versions of ht://Dig were
        susceptible to database corruption of this sort. Versions
        3.1.2 and later are much more stable.

should mention that a currently running database rebuild (running
rundig) apparently can cause the same symptom. And this, I would
think, is a more likely situation for a user to encounter.

 -Tom

-- 
Tom Metro
Venture Logic                                     tmetro@vl.com
Newton, MA, USA

------------------------------------ To unsubscribe from the htdig3-dev mailing list, send a message to htdig3-dev-unsubscribe@htdig.org You'll receive a message confirming the unsubscription.



This archive was generated by hypermail 2b25 : Wed Nov 17 1999 - 13:33:05 PST