htdig: PDF:parse & max_doc_size


Gordon Hopper (kad5@email.byu.edu)
Wed, 25 Nov 1998 19:25:47 -0700


I just discovered that max_doc_size is different from max_head_length.
Furthermore the default for max_doc_size is 100K (defaults.cc). This is
fine except when indexing large PDF files. The problem is that the
error message is not correct. I got many errors like this (with only
one -v):

/tmp/htdig29740.pdf: Could not repair file.
/tmp/htdig29740.pdf: Could not repair file.
/tmp/htdig29740.pdf: Could not repair file.
/tmp/htdig29740.pdf: Could not repair file.
/tmp/htdig29740.pdf: Could not repair file.
PDF::parse: cannot open acroread output
PDF::parse: cannot open acroread output
PDF::parse: cannot open acroread output
PDF::parse: cannot open acroread output
PDF::parse: cannot open acroread output

Repeated many times...

'Could not repair file' made me think that there was a problem with some
of my pdf files or with my acroread program. However, the error should
have said something like this:

Document.cc: /tmp/htdig29740.ps: file is too large

Here is a sample of the debugging output (several -v's) that
demonstrates what is displayed when a document is truncated:

8:8:1:http://www.et.byu.edu/caedm/software/misc/undergrad_cat.pdf:
/tmp/htdig21961.pdf: Could not repair file.
PDF::parse: cannot open acroread output
 size = 1998848

I believe that PDF::parse still indexes the first part of the file when
it is too long. I was unable to locate who is generating the 'Could not
repair file' message. Since PDF::parse is not the problem here, perhaps
this message should not be displayed with only one -v. Unless I missed
something...

Gordon
----------------------------------------------------------------------
To unsubscribe from the htdig mailing list, send a message to
htdig-request@sdsu.edu containing the single word "unsubscribe" in
the body of the message.



This archive was generated by hypermail 2.0b3 on Sat Jan 02 1999 - 16:28:53 PST