[htdig] Problems with htdig 3.1.4

Subject: [htdig] Problems with htdig 3.1.4
From: Phillip Morgan (admin@netbiz.net.au)
Date: Sat Jan 01 2000 - 20:11:12 PST

Hi fellow time travellers,

I have recently installed htdig 3.1.4 and I find that it now indexes
only 1300 of my 60,000+ documnents that the old v2.xx version I was
using indexed.

I have several urls like so...

        http://www.netbiz.net.au \
        http://www.ow.com.au \

and so on.. it only processes the first two. The first one of these has
a directory containing over 60,000 documents. There is a valid trail
leading from one doc to the next.. It used to work on the old version.

Second, The descriptions of some documents contain the word <TITLE>.
(Not the official title used for the html doc), and htdig spits the
dummy reporting that this may be search spamming. Is this just a
warning, and does it drop the doc from the index? How can I get rid of
the warning/problem without removing the <TITLE> description (since the
docs are automatically generated)?

Third, It seems to me, despite modifying the valid_punctuation and
extra_word_character commands, that any file starting with # is ignored.
In fact, it appears to throw htdig into a frenzy. What it does it report
that the entire directory cannot be found, after about a 30 second

For example, a file #dummy.zip lives at
http://www.netbiz.net.au/SEARCH/#dummy.zip. Htdig says it cannot find

I've tried as many variants of the configurations that I can think of,
but I can't get it to index all the listed urls and all of the docs for
each url. Can anyone offer some assistance?

btw: The system is a slackware 4.0 linux (kernel 2.2.6), 192 mb RAM
30gigs disk etc.



NetBiz Internet Services | ICQ: 12796450 P.O. Box 449, Croydon 3136 | FTN: 3:633/252 Email: admin@netbiz.net.au | Vox: +61 3 9876 5295

------------------------------------ To unsubscribe from the htdig mailing list, send a message to htdig-unsubscribe@htdig.org You will receive a message to confirm this.

This archive was generated by hypermail 2b28 : Sat Jan 01 2000 - 20:25:14 PST