[htdig] Parsing PDF files.


Subject: [htdig] Parsing PDF files.
From: Wayne Fool (wfool@ProgressLighting.com)
Date: Thu Jun 15 2000 - 10:31:49 PDT


I have been working on getting PDF files to index. So far the going is
slow. I have 400 PDF files that are in the 20-40k size range.

My hardware is as follows Pentium 75, 32 MEG. RAM, 1.6 gig HD (500 MEG.
free)
It is connected as an Intanet webserver accessible only by people in our
office.

I have htdig version 3.1.5.
I have max_doc_size set to 5000000

I have tried to use parse_doc.pl, conv_doc.pl, and doc2html.pl, all of these
give me 14 consecutive ":=command not found" error messages
a "syntax error near unexpected token '( )' " error messages then finally a
message stating "line 83: 'parts = ( );" This is an example of the error
messages I get with all of the above scripts when I run them manually. I
have checked the location of ps2ascii and pdftotext files in the script and
they are correct. The script just shuts down when run with rundig -vvv

I have also tried acroread. It parses the PDF's and says that it reads
them, but htmerge discards them. I know there is text in the title, which
is what I need for it to index I can see that in the postscript file after
acroread is finished (when run manually)

Following is an excerpt from the command rundig -vvv using acroread:
pick: labweb1, # servers = 1
37:37:3:http://labweb1/pdf/2000001.pdf: Trying local files
  found existing file /home/httpd/html/pdf/2000001.pdf
Read 8192 from document
Read 8192 from document
Read 2218 from document
Read a total of 18602 bytes
PDF::setContents(18602 bytes)
PDF::parse(http://labweb1/pdf/2000001.pdf)

title: P3480 Eclipse Flush Mount
PDF::parse: 5095 lines parsed
PDF::parse ends normally
 size = 18602

It looks like it is reading the title, is there a way to index those words
along with 5095 lines of text. I don't get a file returned from the search
when I search on any of the words in the file.

This is the applicable part of the htdig.conf file:

# These attributes allow indexing server via local filesystem rather than
HTTP.
local_urls: http://labweb1/=/home/httpd/html/
local_user_urls: http://labweb1/=/home/,/public_html/
pdf_parser: /bin/acroread -toPostScript -pairs

#external_parsers: application/msword /usr/local/bin/parse_doc.pl \
                  application/postscript /usr/local/bin/parse_doc.pl \
                  application/pdf /usr/local/bin/parse_doc.pl

I would appreciate it if you could point me in the right direction. This is
driving me nuts. If I need to provide any further information, I would be
glad to. TIA, I appreciated it.

Wayne

------------------------------------
To unsubscribe from the htdig mailing list, send a message to
htdig-unsubscribe@htdig.org
You will receive a message to confirm this.



This archive was generated by hypermail 2b28 : Thu Jun 15 2000 - 08:23:37 PDT