Re: [htdig] Problem with PDF files....


Subject: Re: [htdig] Problem with PDF files....
From: Gilles Detillieux (grdetil@scrc.umanitoba.ca)
Date: Mon Jan 15 2001 - 09:22:12 PST


According to Elijah Kagan:
> I run htdig 3.1.5.
> I tried both the Debian package and a compiled one with the same result.
> I am absolutely sure there is something stupid I forgot to put into the
> configuration.
>
> Attached is the config file.
>
> Thanks for your help.
>
> Elijah
>
>
> On Fri, 12 Jan 2001, Gilles Detillieux wrote:
>
> > According to Elijah Kagan:
> > > 1. I run htdig with an explicit -c option, so it uses the correct conf
> > > file.
> > > 2. I rewrote the external_parsers so it includes only one line...
> > > 3. ..and it is the first line in the file
> > >
> > > Results are the same! It is still looking for an acroread!
> > >
> > > Please, help. I am getting desperate...
> >
> > Hmm. You're sure you're running version 3.1.5 of htdig, and you
> > don't have a pre-3.1.4 binary of htdig kicking around that you might be
> > unknowingly running instead? External converter support was added to the
> > external_parsers attribute only in version 3.1.4 and above. If you're
> > sure this isn't the problem either, please send me a copy of your conf
> > file as it stands now (preferably uuencoded right on your htdig box to
> > prevent e-mail mangling of it), and I'll have a look and try a test or two.
> >
> > Oh, another thing. You mentioned this was on a Debian system. Did you
> > compile htdig yourself, or did you use a pre-compiled binary? If the
> > latter, which one?

OK, it took a while, but the light finally came on! If you look up the
following thread on the mailing list archives:

    http://www.htdig.org/mail/2000/09/index.html#75

you'll see that the bug has come up before. I think there's something
about the Debian configuration for Apache that causes it to add the
"; charset=..." string to the Content-Type header, which is the source
of the problem here. At least I strongly suspect it must be the same
problem, as I can't see anything else that would explain the behaviour
you're reporting. If you run htdig -vvv -i -c ..., you can then look
at the header lines returned by your server for the PDF files, and see
if the Content-Type header does indeed have something on the line after
the application/pdf string.

Geoff and I made some hacks to ExternalParser.cc in the 3.2.0b3
development code to address this, but none of this has been backported
to 3.1.5 yet. I'll see if I can backport some or all of the external
parser patches to 3.1.5 in the next day or two. In the meantime,
you can try working around this either by using local_urls, if you're
running htdig on the same machine as your Apache server, or by using
the same hack that Klaus used, i.e. add a line like the following to
your external_parsers definition.

                        "application/pdf; charset=iso-8859-1->text/html" /usr/share/htdig/conv_doc.pl

-- 
Gilles R. Detillieux              E-mail: <grdetil@scrc.umanitoba.ca>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:    (204)789-3930

------------------------------------ To unsubscribe from the htdig mailing list, send a message to htdig-unsubscribe@htdig.org You will receive a message to confirm this. List archives: <http://www.htdig.org/mail/menu.html> FAQ: <http://www.htdig.org/FAQ.html>



This archive was generated by hypermail 2b28 : Mon Jan 15 2001 - 09:36:46 PST