Subject: Re: [htdig] No Excerpt Error
From: Gilles Detillieux (email@example.com)
Date: Fri Jul 14 2000 - 08:26:35 PDT
According to Paul Watters:
> I'm trying to index a set of PDF files using htdig. I've successfully
> indexed other PDF files using the same installation, but we now have a new
> person doing our PDF's, and they don't seem to be working. We are using
> acroread for parsing.
> If I execute:
> rundig -vvv
> I see a message like the following for each of the PDF files:
> Read a total of 20260 bytes
> PDF::setContents(20260 bytes)
> But, later on, I see the following:
> Deleted, no
> excerpt: 109/http://tango.uac.edu.au/htdig/course/mq/i/300114.pdf
> None of my files are actually being indexed. Does anyone have any
My first inclination would be to conclude that the PDFs don't contain any
text. It's possible to build PDFs that contain only images, particularly
when you build PDFs from scanned documents. Perhaps that's what the new
person is doing? Sometimes you can tell just by viewing them in acroread.
The image-only PDFs will often have grainy looking text. You can also
run acroread -toPostScript on the PDFs and look at the PostScript output.
If there are no text blocks in the output, that's a sure sign as well,
though there may be a lot of PS code to sift through before you can
You may also want to pick up a copy of the xpdf package and doc2html or
conv_doc.pl, and use an external converter rather than acroread, just
to see if you get different results that way. If you don't, then run
pdftotext on some of your PDFs to see if it can get any text out of them.
-- Gilles R. Detillieux E-mail: <firstname.lastname@example.org> Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/~grdetil Dept. Physiology, U. of Manitoba Phone: (204)789-3766 Winnipeg, MB R3E 3J7 (Canada) Fax: (204)789-3930
------------------------------------ To unsubscribe from the htdig mailing list, send a message to email@example.com You will receive a message to confirm this.
This archive was generated by hypermail 2b28 : Fri Jul 14 2000 - 05:43:23 PDT