Subject: [htdig3-dev] Wordlist
From: Geschke Steffen (Steffen.Geschke@erlf.siemens.de)
Date: Sat Nov 25 2000 - 14:20:55 PST
I have just upgraded from 3.2.0b2 to 3.2.0b3 (Snapshot 11-19)
and run in following (new) problem:
The wordlist database is only created for documents of
mime type "text/html". Other mime types are indexed too,
but the wordlist of these requests are not included
in the wordlist.
I didn't change the configuration file from 3.2.0b2 and AFAIK
I did not exclude mime types explicitely.
Here is a little excerpt what htdig says in verbose mode when
I index a pdf file:
Making HTTP request on http://intra1.erlf.siemens.de/test.pdf ... Header line: Content-Type: application/pdf Retrieving document /test.pdf on host: intra1.erlf.siemens.de:80 Status Code : 200 Reason : OK Content-type : application/pdf Persistent connection: not accepted Reading the body of the response 2 - Connection closed (No persistent connection) title: [... correct title of pdf document ...} head: [... correct head of pdf document ...] word: foo@0 ... word: bar@998 ( http://intra1.erlf.siemens.de/test.pdf ignored) size = 52650 pick: intra1.erlf.siemens.de, # servers = 1 > intra1.erlf.siemens.de supports HTTP persistent connections (infinite) htdig: Run complete
It is only required to scan one pdf file named test.pdf. The content of the pdf file is parsed correctly and htdig also find 998 words for the wordlist. However, at the end htdigs ignores the link. Why?
After scanning I get - docdb - docs.index - excerpts
BUT NO words.db!
------------------------------------ To unsubscribe from the htdig3-dev mailing list, send a message to firstname.lastname@example.org You will receive a message to confirm this.
This archive was generated by hypermail 2b28 : Sat Nov 25 2000 - 14:29:51 PST