Re: [htdig] Problems with htdig 3.1.4


Subject: Re: [htdig] Problems with htdig 3.1.4
From: Geoff Hutchison (ghutchis@wso.williams.edu)
Date: Sun Jan 02 2000 - 11:03:46 PST


On Sun, 2 Jan 2000, Phillip Morgan wrote:

> I have recently installed htdig 3.1.4 and I find that it now indexes
> only 1300 of my 60,000+ documnents that the old v2.xx version I was
> using indexed.

2.xx? Oh my, I'd bet Andrew's the only one who can remember back that far.
That certainly predates any documentation I have.

> and so on.. it only processes the first two. The first one of these has
> a directory containing over 60,000 documents. There is a valid trail
> leading from one doc to the next.. It used to work on the old version.

If you're up to sleuthing, you can turn on more debugging output, for
example running "htdig -vvvv" though this will undoubtedly generate
*large* amounts of output. If you can find places where it does not follow
links correctly, we'll get on it. But first, read on.

> dummy reporting that this may be search spamming. Is this just a
> warning, and does it drop the doc from the index? How can I get rid of
> the warning/problem without removing the <TITLE> description (since the
> docs are automatically generated)?

It's a warning, nothing more. From what you say, this sounds like it
should be text, in which case I'd imagine you want &lt;TITLE&gt; instead.
(This is proper HTML nowadays.)

> Third, It seems to me, despite modifying the valid_punctuation and
> extra_word_character commands, that any file starting with # is ignored.
...
> For example, a file #dummy.zip lives at
> http://www.netbiz.net.au/SEARCH/#dummy.zip. Htdig says it cannot find
> http://www.netbiz.net.au/SEARCH.

If the URLs are literal, then you may have found a bug with the URL parser
and it's treatment of anchors. But it's not what you think. The URL you
mention does not refer to a file named "#dummy.zip" rather to a subsection
of the document <http://www.netbiz.net.au/SEARCH/>

I can't think what the proper URL-encoded form of '#' is, but I would
think %23 should be correct. (You can get this from making a FORM into
ACTION="GET".)

> I've tried as many variants of the configurations that I can think of,
> but I can't get it to index all the listed urls and all of the docs for
> each url. Can anyone offer some assistance?

I think Andrew is probably the only person who could guess what
differences you maay have encountered. However, from what you've posted so
far, I bet many of your problems may be obsolete HTML or inproperly-coded
links. I took a short look at your website and it looked OK, but it's hard
to run through 60,000 URLs...

More example URLs would help as well.

-Geoff Hutchison
Williams Students Online
http://wso.williams.edu/

------------------------------------
To unsubscribe from the htdig mailing list, send a message to
htdig-unsubscribe@htdig.org
You will receive a message to confirm this.



This archive was generated by hypermail 2b28 : Sun Jan 02 2000 - 11:19:04 PST