Subject: RE: [htdig3-dev] Wordlist
From: Jost Diederichs (jost@qdusa.com)
Date: Sat Nov 25 2000 - 14:39:57 PST
Yes, that may exactly be the phenomenon I ran into. See my post
(Retriever.cc -...) from Thursday. The clue is the word ignored in your
output of htdig. It is generated by a function in Retriever.cc. There is a
problem with the pointer dup. It is coded like a variable and if you check
your compiler output you will probably find a warning "the address of dup is
always true". I suppose if you do the edit I describe in my previous post
everything will work fine. I have been trying to figure out where dup is
defined and what its meaning is but no success so far and no answers from
the list.
- Jost
-----Original Message-----
From: Geschke Steffen [mailto:Steffen.Geschke@erlf.siemens.de]
Sent: Saturday, November 25, 2000 2:21 PM
To: 'htdig3-dev@htdig.org'
Subject: [htdig3-dev] Wordlist
Hello,
I have just upgraded from 3.2.0b2 to 3.2.0b3 (Snapshot 11-19)
and run in following (new) problem:
The wordlist database is only created for documents of
mime type "text/html". Other mime types are indexed too,
but the wordlist of these requests are not included
in the wordlist.
I didn't change the configuration file from 3.2.0b2 and AFAIK
I did not exclude mime types explicitely.
Here is a little excerpt what htdig says in verbose mode when
I index a pdf file:
--Making HTTP request on http://intra1.erlf.siemens.de/test.pdf ... Header line: Content-Type: application/pdf Retrieving document /test.pdf on host: intra1.erlf.siemens.de:80 Status Code : 200 Reason : OK Content-type : application/pdf Persistent connection: not accepted Reading the body of the response 2 - Connection closed (No persistent connection) title: [... correct title of pdf document ...} head: [... correct head of pdf document ...] word: foo@0 ... word: bar@998 ( http://intra1.erlf.siemens.de/test.pdf ignored) size = 52650 pick: intra1.erlf.siemens.de, # servers = 1 > intra1.erlf.siemens.de supports HTTP persistent connections (infinite) htdig: Run complete
--
It is only required to scan one pdf file named test.pdf. The content of the pdf file is parsed correctly and htdig also find 998 words for the wordlist. However, at the end htdigs ignores the link. Why?
After scanning I get - docdb - docs.index - excerpts
BUT NO words.db!
Any help?!
Steffen
------------------------------------ To unsubscribe from the htdig3-dev mailing list, send a message to htdig3-dev-unsubscribe@htdig.org You will receive a message to confirm this.
------------------------------------ To unsubscribe from the htdig3-dev mailing list, send a message to htdig3-dev-unsubscribe@htdig.org You will receive a message to confirm this.
This archive was generated by hypermail 2b28 : Sat Nov 25 2000 - 14:48:22 PST