Re: htdig: HTDig questions

Geoff Hutchison (
Wed, 16 Sep 1998 22:43:58 -0400

>- Is there a freely available list of synonyms, in English; we would be
>most interested by one focusing on the Telecom industry jargon

Well a new list of synonyms has been compiled by John Banbury
<>. If you'd like a copy, I can mail it to the list or
put it up somewhere.

>- Is there a way to set-up a list of "anti-bad words", i.e. force the
>index words smaller than 3 letters"), still pick-up all instances of AT
>as it can be meaningful to us

Hmm. Right now the best way to do this is to change max_word_length to 2
and try to eliminate all non-useful short words in the bad_words
dictionary. I'm working on a patch to allow factoring the word frequency.
So words like "the" wouldn't need to be explicitly in bad_words since
they'd have such a high frequency.

>- Is there an easy way to extract the list of all the documents in the
>library; conceptually, it could be done by searching with the WOrds
>field empty, but this fails

Yup. Running htdig -t "Create an ASCII version of the document database."
Of course this won't be just a list of the documents...

>- Is it possible to specify at search time whether to use endings or not

You can have a "conf" field with a pop-up menu. Then one conf file can have
the endings and another won't. Otherwise the conf files would be identical.

>- Is it possible to search for phrases

Not yet.

>- What about the new version with DB2 instead of GDB? Why the change? Is
>it quicker? It seems it is still in Beta; when is it supposed to be
>released in final form?

DB2 is faster, for one. The htmerge program is usually 2-3 times faster for
me and searches are faster as well. As for "beta," well... The version most
people had recently was 3.0.8b2 which was "beta." The new version is also
"beta" since I haven't had a chance to test it on a lot of platforms. It
should be more stable than 3.0.8b2. As for a "final release," that depends
on how many bugs are found... :-)

>- Is there a way to dig several documents at the same time in parallel
>(i.e. convert and read through several PDFs at the same time)? Would
>this speed-up the indexing process?

This cannot be done at the moment. It might speed up the indexing depending
on the speed of your CPU and all that.

>- Is it possible to have multiple restricts at search time, like:
>restrict to URL that include both /myserver/docs/subject1 AND .pdf

I believe this should work: <input name="restrict"

>- Is it possible to index multiple servers? Does this requires multiple
>.conf files or can it be done using only one .conf file?

No, you can use one conf file. Just change the limit_urls_to conf option.

>- Are there tools to "massage" the database once it is created; for
>instance, to remove some docs from it, ... in order to avaoid a complete
>rebuild (think of my 16 hours, and we are barelly half way through
>loading the site...)

No idea. It wouldn't be hard to write them, but I don't think they exist at
the moment. There are a few programs in contrib/ that may do similar things.

>- Are bad words excluded at the dig/merge time, or at search time (which
>would increase the size of the database for nothing)
>- Is Proximity searching supported?

See phrase searching. If we had proximity (near) searching, wouldn't we
have phrase searching too? :-)

>- Can Htdig support hit hiliting within PDF documents by using byte
>serving and (is it) XML (?)?

-Geoff Hutchison
Williams Students Online

To unsubscribe from the htdig mailing list, send a message to containing the single word "unsubscribe" in
the body of the message.

This archive was generated by hypermail 2.0b3 on Sat Jan 02 1999 - 16:27:48 PST