Re: [htdig] Still more info on pdf conversion problems.


Subject: Re: [htdig] Still more info on pdf conversion problems.
From: Gilles Detillieux (grdetil@scrc.umanitoba.ca)
Date: Tue Feb 08 2000 - 09:25:00 PST


Hi, Stan. I see you're using ELM as your mailer. You should use the
group reply (g) command, rather than a simple reply (r) command, to make
your replies go to the list as well as the person who posted the message
to which you reply.

According to Stan Brown:
> On Mon Feb 7 16:25:11 2000 Gilles Detillieux wrote...
> >
> >According to Stan Brown:
> >> Sorry to keep posting bits of this, but it's an ongoing battle :-(
> >
> >Unfortunately, in all of these bits you never mentioned which version
> >of htdig you're running. This is extremely relevant, especially now that
> >the first 3.2 beta has been released.
>
> Well,, I was runing 3.1.3, which is why the convert keyword was not
> being recognized. I upgraded to 3.1.4 yesterday, and now indeed I get

The comments in conv_doc.pl do clearly state it's for htdig 3.1.4 or
later. The documentation on www.htdig.org always reflects the latest
stable release, so it may describe things not available in earlier
versions, without explicitly stating at which version a feature came
into effect.

> the same error when convert)doc.pl is called from htdig. However, I do
> not get the error when I run it from the command line. I do get it if I
> tun parse_doc.pl from the command line. Does this make sense? Should I
> tr 3.2 Beta?

This doesn't make much sense. parse_doc.pl and conv_doc.pl call the
various converter programs, like pdftotext, in the same way - it's only
the post-processing that's different. If pdftotext gives an error for
a given PDF file, it should do so consistently, regardless of the script
from which it was called, and regardless of whether that script was
called from htdig or the command line. Are you sure the error message
wasn't just getting lost in the output? Do you get the error message
when you redirect the standard output of parse_doc.pl or conv_doc.pl
(or pdftotext for that matter) to a file? Are you using the same PDF
test file in all cases?

I can see no benefit to testing this under the 3.2 beta, as the problem
seems pretty clearly to be that pdftotext is having problems with some
of your PDFs. Changing the version of htdig, or the parser/converter
script should have no effect on this.

> >That error definitely came from pdftotext. Running conv_doc.pl on the
> >same PDF will give you the same error message. If you're sure the PDF
> >file is correct, and pdftotext is in error, then you should report this
> >to Derek Noonberg, maintainer of the xpdf package. See the xpdf docs for
> >contact info.
>
> Acreoread (3 or 4) and xpdf, have no trouble reading these files.

This is very odd. I'd think that if pdftotext has problems reading a
file, xpdf would as well. You seem to be suggesting that even pdftotext
does not consistently give the same error message for the same PDF file.
Can you definitively confirm this? If so, there may be something wrong
with your xpdf package - perhaps it wasn't properly built for your system.

> BTW, since I made a stupid basic mistae with teh version thing, let me
> get a sanity check on something else. I am expecting to see th --**+++
> stuff while the PDF files are being processed, as I do with HTML files,
> if runing htdig with -v, right? If I'm not it indicates that no
> extraction is being done, correct?

No, the --**+++ output is a result of parsed links to other documents.
The + indicates links added to the queue, the - indicates links that are
skipped or excluded, and * indicates links to documents that have already
been queued or parsed. As parse_doc.pl and conv_doc.pl don't attempt to
extract any hypertext links from the documents they parse, you won't see
any such output when these documents are parsed.

> If I run pdftotext by hand. I DO NOT get the errors. Instead I get a
> useful looking output file.

This is really strange. If pdftotext can correctly parse all of your PDF
files, especially the one(s) that gave you problems, then I can see no
reason for parse_doc.pl or conv_doc.pl to fail. Do you have more than
one pdftotext program installed on your system? Are the perl scripts
configured to run the same one that you run from the command line?
(Try the command "which pdftotext".) Is there a possibility that all
the virtual memory on your system is being used up at times, causing
things to fail intermittantly?

-- 
Gilles R. Detillieux              E-mail: <grdetil@scrc.umanitoba.ca>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:    (204)789-3930

------------------------------------ To unsubscribe from the htdig mailing list, send a message to htdig-unsubscribe@htdig.org You will receive a message to confirm this.



This archive was generated by hypermail 2b28 : Tue Feb 08 2000 - 09:27:24 PST