[htdig] PDF indexing problem: Deleted, no excerpt


Subject: [htdig] PDF indexing problem: Deleted, no excerpt
From: Mike Gardner (Mike.Gardner@nottingham.ac.uk)
Date: Wed Aug 09 2000 - 05:47:04 PDT


I have htdig 3.1.5 happily installed on my suse6.4 box but it doesn't seem to be indexing PDF files.
Heres some details:

HTDIG.CONF
pdf_parser /usr/local/bin/acroread -toPostScript -pairs
max_doc_size 300000

(acroread is version 3.1 and will happily convert a sample PDF to PS; all PDFs are well under the max_doc_size)

HTDIG -v
lists the PDF files & their size OK (ie looks as though indexing)
however I don't see the '+--+--**' that you get for HTML files - is this a problem?

HTMERGE -v
"Deleted, no excerpt: x/http;//.......PDF"
I get this message once for each of my PDFs

I read in an earlier post that "Deleted, no excerpt" can be due to:
> > - disallowed in robots.txt
> > - indexing turned off by meta robots or noindex tags
> > - no indexable text in documents
> > - server_max_docs exceeded
> Also when merging:
> - duplicates between the two databases (oldest is removed)

These files aren't dissallowed / turned off.
server_max_docs isn't set in my httpd.conf - I don't think that this will be a problem as its a small site (around 100 pages)
So I assume that theres no indexable text as the PDF parsing failed (even though there were no error messages).

Any hints anyone?
Or should I just install xpdf and try that?

thanks in advance,
mike

------------------------------------
To unsubscribe from the htdig mailing list, send a message to
htdig-unsubscribe@htdig.org
You will receive a message to confirm this.



This archive was generated by hypermail 2b28 : Tue Aug 08 2000 - 19:46:33 PDT