Re: htdig: PDF:parse & max_doc_size


J. op den Brouw (MSQL_User@st.hhs.nl)
Thu, 26 Nov 1998 11:33:34 +0100


Gordon Hopper wrote:
>
> I just discovered that max_doc_size is different from max_head_length.
> Furthermore the default for max_doc_size is 100K (defaults.cc). This is
> fine except when indexing large PDF files. The problem is that the
> error message is not correct. I got many errors like this (with only
> one -v):
>
> /tmp/htdig29740.pdf: Could not repair file.
> /tmp/htdig29740.pdf: Could not repair file.
> /tmp/htdig29740.pdf: Could not repair file.
> /tmp/htdig29740.pdf: Could not repair file.
> /tmp/htdig29740.pdf: Could not repair file.
> PDF::parse: cannot open acroread output
> PDF::parse: cannot open acroread output
> PDF::parse: cannot open acroread output
> PDF::parse: cannot open acroread output
> PDF::parse: cannot open acroread output
>
> Repeated many times...
>
> 'Could not repair file' made me think that there was a problem with some
> of my pdf files or with my acroread program. However, the error should
> have said something like this:
>
> Document.cc: /tmp/htdig29740.ps: file is too large
>
> Here is a sample of the debugging output (several -v's) that
> demonstrates what is displayed when a document is truncated:
>
> 8:8:1:http://www.et.byu.edu/caedm/software/misc/undergrad_cat.pdf:
> /tmp/htdig21961.pdf: Could not repair file.
> PDF::parse: cannot open acroread output
> size = 1998848
>
> I believe that PDF::parse still indexes the first part of the file when
> it is too long. I was unable to locate who is generating the 'Could not
> repair file' message. Since PDF::parse is not the problem here, perhaps
> this message should not be displayed with only one -v. Unless I missed
> something...
>
> Gordon

My guess is, that acroread is trying to "repair" the file as the file is
truncated by htdig (onl the first ... bytes are passed to acroread), and
doesn't succeed in fixing it.

--jesse
----------------------------------------------------------------------
To unsubscribe from the htdig mailing list, send a message to
htdig-request@sdsu.edu containing the single word "unsubscribe" in
the body of the message.



This archive was generated by hypermail 2.0b3 on Sat Jan 02 1999 - 16:28:53 PST