[htdig] PDF problems


Subject: [htdig] PDF problems
From: The Melia Family (melias@hypermax.net.au)
Date: Sat Dec 30 2000 - 23:41:01 PST


Hello,

I am using HTDIG 3.1.5 on Redhat 7.0, and am having problems indexing PDF
files. I have included my config & -vv output below. I have no robots.txt
file, and my max_doc_size is now 10M (one test .pdf file is only 27K and it
also fails), as well as not rejecting pdf as an extension.
I am using the latest xpdf with pdftotext, as well as the latest parse_doc
and conv_doc scripts.

I can manually pdftotext the pdf files and they do contain real text, not
just images, I can also run parse_doc and conv_doc.plthey produce proper
text. WHen I do a rundig, I get a 'URL rejected' message, I do not know
why, this (I presume) leads to a Deleted No Excerpt message and the file (or
any pdf file) is not indexed. Any suggestions??

Regards,
Tony

___________BELOW is my CONFIG ________

external_parsers: application/msword /usr/bin/parse_doc.pl \
                  application/postscript /usr/bin/parse_doc.pl \
                  application/pdf /usr/bin/parse_doc.pl

database_dir: /data/software/htdigdb

local_urls: http://80.1.1.4/=/var/www/html/

start_url: http://80.1.1.4/htdig/

limit_urls_to: ${start_url}

exclude_urls: /cgi-bin/ .cgi

bad_extensions: .wav .gz .z .sit .au .zip .tar .hqx .exe .com .gif
.iso\
                .jpg .jpeg .aiff .class .map .ram .tgz .bin .rpm .mpg .mov
.avi

maintainer: root@80.1.1.4

max_head_length: 50000

max_doc_size: 10000000

no_excerpt_show_top: true

search_algorithm: exact:1 synonyms:0.5 endings:0.1

no_next_page_text:
no_prev_page_text:

____________Below is output of rundig -vv using 2 pdf files and 1 txt and
files ______

New server: 80.1.1.4, 80
Trying local files
  tried local file /var/www/html/robots.txt
Local retrieval failed, trying HTTP
pick: 80.1.1.4, # servers = 1
0:0:0:http://80.1.1.4/htdig/mx59pro/manual/english/: Trying local files
  tried local file /var/www/html/htdig/mx59pro/manual/english/index.html
Local retrieval failed, trying HTTP

title: Index of /htdig/mx59pro/manual/english
A tag: pos = 2, position = ="?N=D">

   pushing http://80.1.1.4/htdig/mx59pro/manual/english/?N=D
+A tag: pos = 2, position = ="?M=A">

   pushing http://80.1.1.4/htdig/mx59pro/manual/english/?M=A
+A tag: pos = 2, position = ="?S=A">

   pushing http://80.1.1.4/htdig/mx59pro/manual/english/?S=A
+A tag: pos = 2, position = ="?D=A">

   pushing http://80.1.1.4/htdig/mx59pro/manual/english/?D=A
+A tag: pos = 2, position = ="/htdig/mx59pro/manual/">

url rejected: (level 1)http://80.1.1.4/htdig/mx59pro/manual/
A tag: pos = 2, position = ="content.pdf">

   pushing http://80.1.1.4/htdig/mx59pro/manual/english/content.pdf
+A tag: pos = 2, position = ="content.txt">

   pushing http://80.1.1.4/htdig/mx59pro/manual/english/content.txt
+A tag: pos = 2, position = ="sonic.pdf">

   pushing http://80.1.1.4/htdig/mx59pro/manual/english/sonic.pdf
+ size = 954
pick: 80.1.1.4, # servers = 1
1:1:1:http://80.1.1.4/htdig/mx59pro/manual/english/?N=D: Trying local files
  tried local file /var/www/html/htdig/mx59pro/manual/english/?N=D
Local retrieval failed, trying HTTP

title: Index of /htdig/mx59pro/manual/english
A tag: pos = 2, position = ="?N=A">

   pushing http://80.1.1.4/htdig/mx59pro/manual/english/?N=A
+A tag: pos = 2, position = ="?M=A">
*A tag: pos = 2, position = ="?S=A">
*A tag: pos = 2, position = ="?D=A">
*A tag: pos = 2, position = ="/htdig/mx59pro/manual/">

url rejected: (level 1)http://80.1.1.4/htdig/mx59pro/manual/
A tag: pos = 2, position = ="sonic.pdf">
*A tag: pos = 2, position = ="content.txt">
*A tag: pos = 2, position = ="content.pdf">
* size = 954
pick: 80.1.1.4, # servers = 1
2:2:1:http://80.1.1.4/htdig/mx59pro/manual/english/?M=A: Trying local files
  tried local file /var/www/html/htdig/mx59pro/manual/english/?M=A
Local retrieval failed, trying HTTP

title: Index of /htdig/mx59pro/manual/english
A tag: pos = 2, position = ="?N=A">
*A tag: pos = 2, position = ="?M=D">

   pushing http://80.1.1.4/htdig/mx59pro/manual/english/?M=D
+A tag: pos = 2, position = ="?S=A">
*A tag: pos = 2, position = ="?D=A">
*A tag: pos = 2, position = ="/htdig/mx59pro/manual/">

url rejected: (level 1)http://80.1.1.4/htdig/mx59pro/manual/
A tag: pos = 2, position = ="content.pdf">
*A tag: pos = 2, position = ="sonic.pdf">
*A tag: pos = 2, position = ="content.txt">
* size = 954
pick: 80.1.1.4, # servers = 1
3:3:1:http://80.1.1.4/htdig/mx59pro/manual/english/?S=A: Trying local files
  tried local file /var/www/html/htdig/mx59pro/manual/english/?S=A
Local retrieval failed, trying HTTP

title: Index of /htdig/mx59pro/manual/english
A tag: pos = 2, position = ="?N=A">
*A tag: pos = 2, position = ="?M=A">
*A tag: pos = 2, position = ="?S=D">

   pushing http://80.1.1.4/htdig/mx59pro/manual/english/?S=D
+A tag: pos = 2, position = ="?D=A">
*A tag: pos = 2, position = ="/htdig/mx59pro/manual/">

url rejected: (level 1)http://80.1.1.4/htdig/mx59pro/manual/
A tag: pos = 2, position = ="content.txt">
*A tag: pos = 2, position = ="content.pdf">
*A tag: pos = 2, position = ="sonic.pdf">
* size = 954
pick: 80.1.1.4, # servers = 1
4:4:1:http://80.1.1.4/htdig/mx59pro/manual/english/?D=A: Trying local files
  tried local file /var/www/html/htdig/mx59pro/manual/english/?D=A
Local retrieval failed, trying HTTP

title: Index of /htdig/mx59pro/manual/english
A tag: pos = 2, position = ="?N=A">
*A tag: pos = 2, position = ="?M=A">
*A tag: pos = 2, position = ="?S=A">
*A tag: pos = 2, position = ="?D=D">

   pushing http://80.1.1.4/htdig/mx59pro/manual/english/?D=D
+A tag: pos = 2, position = ="/htdig/mx59pro/manual/">

url rejected: (level 1)http://80.1.1.4/htdig/mx59pro/manual/
A tag: pos = 2, position = ="content.pdf">
*A tag: pos = 2, position = ="content.txt">
*A tag: pos = 2, position = ="sonic.pdf">
* size = 954
pick: 80.1.1.4, # servers = 1
5:5:1:http://80.1.1.4/htdig/mx59pro/manual/english/content.pdf: Trying local
files
  found existing file /var/www/html/htdig/mx59pro/manual/english/content.pdf
 size = 6705
pick: 80.1.1.4, # servers = 1
6:6:1:http://80.1.1.4/htdig/mx59pro/manual/english/content.txt: Trying local
files
  found existing file /var/www/html/htdig/mx59pro/manual/english/content.txt
 size = 115
pick: 80.1.1.4, # servers = 1
7:7:1:http://80.1.1.4/htdig/mx59pro/manual/english/sonic.pdf: Trying local
files
  found existing file /var/www/html/htdig/mx59pro/manual/english/sonic.pdf
 size = 377264
pick: 80.1.1.4, # servers = 1
8:8:2:http://80.1.1.4/htdig/mx59pro/manual/english/?N=A: Trying local files
  tried local file /var/www/html/htdig/mx59pro/manual/english/?N=A
Local retrieval failed, trying HTTP

title: Index of /htdig/mx59pro/manual/english
A tag: pos = 2, position = ="?N=D">
*A tag: pos = 2, position = ="?M=A">
*A tag: pos = 2, position = ="?S=A">
*A tag: pos = 2, position = ="?D=A">
*A tag: pos = 2, position = ="/htdig/mx59pro/manual/">

url rejected: (level 1)http://80.1.1.4/htdig/mx59pro/manual/
A tag: pos = 2, position = ="content.pdf">
*A tag: pos = 2, position = ="content.txt">
*A tag: pos = 2, position = ="sonic.pdf">
* size = 954
pick: 80.1.1.4, # servers = 1
9:9:2:http://80.1.1.4/htdig/mx59pro/manual/english/?M=D: Trying local files
  tried local file /var/www/html/htdig/mx59pro/manual/english/?M=D
Local retrieval failed, trying HTTP

title: Index of /htdig/mx59pro/manual/english
A tag: pos = 2, position = ="?N=A">
*A tag: pos = 2, position = ="?M=A">
*A tag: pos = 2, position = ="?S=A">
*A tag: pos = 2, position = ="?D=A">
*A tag: pos = 2, position = ="/htdig/mx59pro/manual/">

url rejected: (level 1)http://80.1.1.4/htdig/mx59pro/manual/
A tag: pos = 2, position = ="content.txt">
*A tag: pos = 2, position = ="sonic.pdf">
*A tag: pos = 2, position = ="content.pdf">
* size = 954
pick: 80.1.1.4, # servers = 1
10:10:2:http://80.1.1.4/htdig/mx59pro/manual/english/?S=D: Trying local
files
  tried local file /var/www/html/htdig/mx59pro/manual/english/?S=D
Local retrieval failed, trying HTTP

title: Index of /htdig/mx59pro/manual/english
A tag: pos = 2, position = ="?N=A">
*A tag: pos = 2, position = ="?M=A">
*A tag: pos = 2, position = ="?S=A">
*A tag: pos = 2, position = ="?D=A">
*A tag: pos = 2, position = ="/htdig/mx59pro/manual/">

url rejected: (level 1)http://80.1.1.4/htdig/mx59pro/manual/
A tag: pos = 2, position = ="sonic.pdf">
*A tag: pos = 2, position = ="content.pdf">
*A tag: pos = 2, position = ="content.txt">
* size = 954
pick: 80.1.1.4, # servers = 1
11:11:2:http://80.1.1.4/htdig/mx59pro/manual/english/?D=D: Trying local
files
  tried local file /var/www/html/htdig/mx59pro/manual/english/?D=D
Local retrieval failed, trying HTTP

title: Index of /htdig/mx59pro/manual/english
A tag: pos = 2, position = ="?N=A">
*A tag: pos = 2, position = ="?M=A">
*A tag: pos = 2, position = ="?S=A">
*A tag: pos = 2, position = ="?D=A">
*A tag: pos = 2, position = ="/htdig/mx59pro/manual/">

url rejected: (level 1)http://80.1.1.4/htdig/mx59pro/manual/
A tag: pos = 2, position = ="sonic.pdf">
*A tag: pos = 2, position = ="content.txt">
*A tag: pos = 2, position = ="content.pdf">
* size = 954
pick: 80.1.1.4, # servers = 1
htmerge: Sorting...
htmerge: Merging...

0/http://80.1.1.4/htdig/mx59pro/manual/english/
4/http://80.1.1.4/htdig/mx59pro/manual/english/?D=A
11/http://80.1.1.4/htdig/mx59pro/manual/english/?D=D
2/http://80.1.1.4/htdig/mx59pro/manual/english/?M=A
9/http://80.1.1.4/htdig/mx59pro/manual/english/?M=D
8/http://80.1.1.4/htdig/mx59pro/manual/english/?N=A
1/http://80.1.1.4/htdig/mx59pro/manual/english/?N=D
3/http://80.1.1.4/htdig/mx59pro/manual/english/?S=A
10/http://80.1.1.4/htdig/mx59pro/manual/english/?S=D
Deleted, no excerpt:
5/http://80.1.1.4/htdig/mx59pro/manual/english/content.pdf
6/http://80.1.1.4/htdig/mx59pro/manual/english/content.txt
htmerge: 10
Deleted, no excerpt:
7/http://80.1.1.4/htdig/mx59pro/manual/english/sonic.pdf

------------------------------------
To unsubscribe from the htdig mailing list, send a message to
htdig-unsubscribe@htdig.org
You will receive a message to confirm this.
List archives: <http://www.htdig.org/mail/menu.html>
FAQ: <http://www.htdig.org/FAQ.html>



This archive was generated by hypermail 2b28 : Sat Dec 30 2000 - 23:56:51 PST