Jim Cole (email@example.com)
Wed, 30 Jun 1999 23:54:39 -0600
I have been having a problem getting htdig to build a reasonable
database for a particular site. Specifically, the combined database
sizes were ending up on the order of 3 to 4 times larger than the entire
site. I believe I found the cause of this problem, and while not
technically a problem with htdig, I thought I would pass the information
on in the hope that it will save someone else a week of building broken
databases and reading debug output :)
While examining htdig's output using the -vvv option, I discovered that
htdig was creating a lot of broken GET requests. Toward the end, they
were looking something like...
From the server's point of view, everything in the URL after the
index.html is garbage, and the same page (index.html) is returned over
and over with all relative links resulting in new unique URLs that again
result in htdig grabbing the same index.html file.
As far as I can tell, the melt down originates with a small syntax error
in one users page. This user had a link that looks like...
This then resolved to a new, unique URL of
http://www.########.org/index.html/ So, htdig went ahead and processed
it as such. When relative links were found in the index.html file, new
URLs were generated, such as
When htdig did the GET on this specific URL, the server of course
returned index.html instead of qpost.html, but treated relative links in
index.html as if they were relative to
http://www.########.org/index.html/queries, which generated URLs like
process continued, generating longer and longer bogus URLs. Not sure
what finally broke the cycle.
I am in the process of trying to crawl the site again with .html/ and
.htm/ added to the exclude_urls attribute. On the off chance that this
doesn't work, does anyone have other ideas about how to avoid this
problem? Well, short of validating thousands of pages contributed by
dozens of people? ;)
To unsubscribe from the htdig mailing list, send a message to
firstname.lastname@example.org containing the single word "unsubscribe" in
the SUBJECT of the message.
This archive was generated by hypermail 2.0b3 on Wed Jun 30 1999 - 22:04:50 PDT