[htdig] Duplicate URLs?


Jim Cole (greyleaf@yggdrasill.net)
Sun, 27 Jun 1999 16:37:32 -0600


Hi.

I have been trying to run htdig on a site that consists of about 180 MB
counting everything in the entire directory tree, a significant portion
of which is not even linked through any URL. However, the combined size
of the word and document databases are ending up in the 500 to 600 MB
range; currently way too big for me to host anywhere. While trying to
understand why so much disk space was being used, I examined the verbose
output of htdig and noticed that the same HTML files were scrolling by
dozens of times, each time with a slightly different path(?). A sample
of the output is shown below.

21360:21360:4:http://www.somesite.org/index.html/queries/hosts/fyi/queries/hosts/hosts/fyi/whatsnew.htm:
-**********-******---------*-----***--- size = 17511
21361:21361:4:http://www.somesite.org/index.html/queries/hosts/fyi/queries/hosts/hosts/fyi/events.htm:
-**********-******---------*-----***--- size = 17511
..
21371:21371:3:http://www.somesite.org/index.html/queries/hosts/fyi/queries/hosts/hosts/hosts/whatsnew.htm:
-**********-******---------*-----***--- size = 17511
21372:21372:3:http://www.somesite.org/index.html/queries/hosts/fyi/queries/hosts/hosts/hosts/events.htm:
-**********-******---------*-----***--- size = 17511
..
21393:21393:8:http://www.somesite.org/index.html/queries/hosts/fyi/fyi/queries/queries/fyi/whatsnew.htm:
-**********-******---------*-----***--- size = 17511
21394:21394:8:http://www.somesite.org/index.html/queries/hosts/fyi/fyi/queries/queries/fyi/events.htm:
-**********-******---------*-----***--- size = 17511
..

Is this normal behavior? Might it have something to do with why my
databases are becoming so large?

Except for changing start_url, database_dir, and maintainer, I am using
the default config file.

Any help would be appreciated.

Thanks.

Jim Cole
------------------------------------
To unsubscribe from the htdig mailing list, send a message to
htdig@htdig.org containing the single word "unsubscribe" in
the SUBJECT of the message.



This archive was generated by hypermail 2.0b3 on Sun Jun 27 1999 - 14:55:27 PDT