Re: htdig: PDF:parse & max_doc_size


Anthony Peacock (a.peacock@chime.ucl.ac.uk)
Thu, 26 Nov 1998 11:26:03 +0000


>
> I just discovered that max_doc_size is different from max_head_length.
> Furthermore the default for max_doc_size is 100K (defaults.cc). This is
> fine except when indexing large PDF files. The problem is that the
> error message is not correct. I got many errors like this (with only
> one -v):
>
> /tmp/htdig29740.pdf: Could not repair file.
> /tmp/htdig29740.pdf: Could not repair file.
> /tmp/htdig29740.pdf: Could not repair file.
> /tmp/htdig29740.pdf: Could not repair file.
> /tmp/htdig29740.pdf: Could not repair file.
> PDF::parse: cannot open acroread output
> PDF::parse: cannot open acroread output
> PDF::parse: cannot open acroread output
> PDF::parse: cannot open acroread output
> PDF::parse: cannot open acroread output
>
> Repeated many times...
>
> 'Could not repair file' made me think that there was a problem with some
> of my pdf files or with my acroread program. However, the error should
> have said something like this:
>
> Document.cc: /tmp/htdig29740.ps: file is too large

<SNIP>

Hi,

I have just had to solve this problem myself. As I understand it, the 'Can't
repair file' message is coming from Acroread. The sequence of events is
something like this (someone correct me if I am wrong:-):

1 htdig, copies no more than max_doc_size bytes of the .pdf file to a tempory
    file

2 htdig then fires up acroread and passes the temporary file name

3 If the file was bigger than max_doc_size bytes, Acroread encounters an
   unexpected EOF, and assumes that the .pdf file is corrupt, hence the error

4 Acroread returns with an error

5 htdig reports the 'Can't read acroread output' error

It was a little confusing to start with, but it is dealt with in the FAQ.

Fare Thee Well
Anthony Peacock
CHIME, UCL Medical School
E-Mail: a.peacock@chime.ucl.ac.uk
WWW: http://www.chime.ucl.ac.uk/~rmhiajp/
----------------------------------------------------------------------
To unsubscribe from the htdig mailing list, send a message to
htdig-request@sdsu.edu containing the single word "unsubscribe" in
the body of the message.



This archive was generated by hypermail 2.0b3 on Sat Jan 02 1999 - 16:28:54 PST