RE: [htdig3-dev] Fetching outside of domain list (not supposed to)


Subject: RE: [htdig3-dev] Fetching outside of domain list (not supposed to)
From: Toxik - Dann Cohen (dann.cohen@toxik.com)
Date: Thu Jan 04 2001 - 09:54:57 PST


Hi Gilles,

If I set the max_hop_count to 0, it will only fetch the first page, and want it to fetch 1 page further so max_hop_count need to be at 1 but what's happening is that the fetch goes behond the 1800 domains, when it's supposed to reject the domain that are not in the start_url...

Any suggestion, by the way it works fine when there less domain say 1500 domains ??? very strange...

Dann Cohen - Dir., Outsourcing and Information Systems
Toxik Technologies Inc. - Montreal, QC, Canada
www.toxik.com - Phone: (514) 528-6945 x 2 . Fax: (514) 221-3329

-----Original Message-----
From: Gilles Detillieux [mailto:grdetil@scrc.umanitoba.ca]
Sent: 4 janvier, 2001 12:04
To: Toxik - Dann Cohen
Cc: htdig3-dev@htdig.org
Subject: Re: [htdig3-dev] Fetching outside of domain list (not supposed
to)

According to Toxik - Dann Cohen:
> I'm a new comer (6 month user of ht://dig) to this list and before
> saying anything I would like to say hi to everyone. Now to the good
> stuff =)
>
> I've encounter a problem with the fetching part. I have about 1800 site
> in my "start_url" to fetch with a "max_hop_count" of 1 and it seems to
> go beyond the 1800.
>
> HTTP statistics
> ===============
> Persistent connections : Yes
> HEAD call before GET : No
> Connections opened : 14973
> Connections closed : 14973
> Changes of server : 6030
> HTTP Requests : 35357
> HTTP KBytes requested : 209216
> HTTP Average request time : 0.647679 secs
> HTTP Average speed : 9.13605 KBytes/secs
>
> Has you can see the value of "changes server" is higher than 1800. I can
> also see in the log that it goes beyond the domain (see bellow for an
> example), the domain is www.singapore-inc.com and you can see that a
> "mailto:" and "www.sedb.com.sg" is pushed in. The problem doesn't happen
> when I fetch them alone, any suggestion or hints are welcome.

If you haven't already figured it out, you should be setting max_hop_count
to 0, not 1. One hop means it will attempt to follow all the valid links
in those initial 1800 documents.

-- 
Gilles R. Detillieux              E-mail: <grdetil@scrc.umanitoba.ca>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:    (204)789-3930

------------------------------------ To unsubscribe from the htdig3-dev mailing list, send a message to htdig3-dev-unsubscribe@htdig.org You will receive a message to confirm this.



This archive was generated by hypermail 2b28 : Thu Jan 04 2001 - 10:06:26 PST