Re: [htdig] What's the best parser?


Subject: Re: [htdig] What's the best parser?
From: Gilles Detillieux (grdetil@scrc.umanitoba.ca)
Date: Tue Oct 17 2000 - 10:29:00 PDT


According to Martin Mielke:
> Hello all,
>
> nowadays I have implemented conv_doc.pl as general parser for PDF,
> PostScript and M$ Word documents.
> >From time to time I get error messages like:
>
> --8<--8<--8<--
>
> Error (0): PDF file is damaged - attempting to reconstruct xref table...
> Error: Couldn't find trailer dictionary
> Error: Couldn't read xref table
> Error (0): PDF file is damaged - attempting to reconstruct xref table...
> Error: Couldn't find trailer dictionary
> Error: Couldn't read xref table
> Error (139803): Bad colorspace
>
> --8<--8<--8<--
>
> Even though the 'max_doc_size' is set high enough for all PDFs to be parsed
> correctly and the files are safe and sound (users can open/read them without
> problems).
> Therefore I wonder if this is a parser-dependant issue rather than a
> configuration one. Maybe you have better experiences with other parsers
> giving best results... I'd like to hear some before
> downloading/installing/reconfiguring things here...

For PDFs, we still recommend conv_doc.pl or doc2html.pl as the front-end
external converter. Both of these use pdftotext from the xpdf package
as the actual conversion program.

To separate out parser versus configuration problems, you should try
running pdftotext directly on a few of the PDF files that are giving you
problems. If you still get these errors, you should probably report them
to the xpdf author, Derek Noonburg. If you don't get the same errors,
it's likely a problem with your htdig configuration. You can also run
conv_doc.pl or doc2html.pl directly on your PDFs to see if that works,
to separate out problems in htdig, pdftotext, or the Perl scripts.

-- 
Gilles R. Detillieux              E-mail: <grdetil@scrc.umanitoba.ca>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:    (204)789-3930

------------------------------------ To unsubscribe from the htdig mailing list, send a message to htdig-unsubscribe@htdig.org You will receive a message to confirm this. List archives: <http://www.htdig.org/mail/menu.html> FAQ: <http://www.htdig.org/FAQ.html>



This archive was generated by hypermail 2b28 : Tue Oct 17 2000 - 10:34:10 PDT