Re: [htdig] Indexing PDF Files


Subject: Re: [htdig] Indexing PDF Files
From: Gilles Detillieux (grdetil@scrc.umanitoba.ca)
Date: Wed Nov 01 2000 - 14:40:38 PST


If that still doesn't solve the problem, try running conv_doc.pl (or even
pdftotext) directly on some of your problem PDF files. I suspect that
these files contain no indexable text, but only images, which is a common
problem with some PDFs.

You also didn't mention how you installed htdig on your Red Hat 6.2 system.
If you installed from RPM, make sure you used the correct one, i.e. the
"glibc21" version.

According to creep@datacreep.net:
> Use conv_doc.pl instead of parse_doc
>
> get it from http://www.htdig.org/files/contrib/parsers/conv_doc.pl.gz
> gunzip it and move it to /usr/local/bin
>
> get xpdf from ftp://ftp.foolabs.com/pub/xpdf/xpdf-0.91.tgz
>
> get ps2ascii from your freetype or ghostscript installation
>
> put this in your conf/htdig.conf
> external_parsers:
> application/msword->text/html /usr/local/bin/conv_doc.pl \
> application/postscript->text/html /usr/local/bin/conv_doc.pl \
> application/pdf->text/html /usr/local/bin/conv_doc.pl
>
>
> On Wed, 1 Nov 2000, Roy Stephane wrote:
>
> > I have problems indexing PDF Files. I have already considered the FAQ 4.9
> > and 5.2. So all my path are OK and the MAX_DOC_SIZE parameter is greater
> > than my bigger PDF file. I am working with the external parser "
> > parse_doc.pl ".
> >
> > When I perform rundig in verbose mode, I find that htdig recognise all my
> > PDF files, it shows theire size. After that, when htmerge find a PDF, it say
> > that there is no excerpt, so the file (temporary file) is deleted.
> >
> > I tried to find the parameters that are used to call htdig form rundig.
> > Since an output command on each variables shows nothing on screen, I asume
> > that all the parameters used are having null value.
> >
> > I am using RedHat 6.2, an Appache 1.3
> >
> > Thanks for your help!
> >
> > Stéphane Roy
> > sroy@oerlikon.ca <mailto:sroy@oerlikon.ca>
> > (450) 542-5906

-- 
Gilles R. Detillieux              E-mail: <grdetil@scrc.umanitoba.ca>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:    (204)789-3930

------------------------------------ To unsubscribe from the htdig mailing list, send a message to htdig-unsubscribe@htdig.org You will receive a message to confirm this. List archives: <http://www.htdig.org/mail/menu.html> FAQ: <http://www.htdig.org/FAQ.html>



This archive was generated by hypermail 2b28 : Wed Nov 01 2000 - 14:47:01 PST