htdig: HTDig questions


Jacques Le Mouel (lemouel@sprintmail.com)
Wed, 16 Sep 1998 16:52:20 -0400


First, thanks and congratulation to all involved for such a great tool.

Then, I have many questions regarding HTDig.
I have recently set-up HTDig 3.0.8b2 on a HP-Ux 9000/800 G30 running
HP-Ux 10.01 and Netscape Enterprise server 2.0.
This is to provide search facility to an Intranet site of a few thousand
documents in PDF.

The questions are (in no particular order, as they say):
- Is there a freely available list of synonyms, in English; we would be
most interested by one focusing on the Telecom industry jargon
- Is there a way to set-up a list of "anti-bad words", i.e. force the
inclusion of a word when it appears in a document, even if it is smaller
than the threshold; for instance, if we cut at 3 letters (i.e. "do not
index words smaller than 3 letters"), still pick-up all instances of AT
as it can be meaningful to us
- If I run htdig -i and later change the .conf file, the changes don't
seem to be taken in consideration (especially start page, excludes...).
Am I doing something wrong?
- Is there an easy way to extract the list of all the documents in the
library; conceptually, it could be done by searching with the WOrds
field empty, but this fails
- Is it possible to specify at search time whether to use endings or not
- Is it possible to search for phrases
- What about the new version with DB2 instead of GDB? Why the change? Is
it quicker? It seems it is still in Beta; when is it supposed to be
released in final form? Considering it requires rebuilding the index
from scratch, and it takes us a very long time to do that (16 hours on
our server, for 160MB of PDF; the really slow part is the "acroread
-toPostScript" for all the files), is it worth moving to this version?
- Is there a way to dig several documents at the same time in parallel
(i.e. convert and read through several PDFs at the same time)? Would
this speed-up the indexing process?
- Is it possible to have multiple restricts at search time, like:
restrict to URL that include both /myserver/docs/subject1 AND .pdf
- Is it possible to index multiple servers? Does this requires multiple
.conf files or can it be done using only one .conf file?
- Are there tools to "massage" the database once it is created; for
instance, to remove some docs from it, ... in order to avaoid a complete
rebuild (think of my 16 hours, and we are barelly half way through
loading the site...)
- Are bad words excluded at the dig/merge time, or at search time (which
would increase the size of the database for nothing)
- Is Proximity searching supported?
- Can Htdig support hit hiliting within PDF documents by using byte
serving and (is it) XML (?)?

OK, enough question for now maybe.
Any help would be greatly appreciated.

Jacques Le Mouel
----------------------------------------------------------------------
To unsubscribe from the htdig mailing list, send a message to
htdig-request@sdsu.edu containing the single word "unsubscribe" in
the body of the message.



This archive was generated by hypermail 2.0b3 on Sat Jan 02 1999 - 16:27:45 PST