Re: [htdig] premature merging


Subject: Re: [htdig] premature merging
From: campbel@pc177.cisti.nrc.ca
Date: Fri Aug 11 2000 - 12:15:23 PDT


According to Geoff Hutchison:
> On Fri, 11 Aug 2000 campbel@pc177.cisti.nrc.ca wrote:
> > syntax in the config files so I know that it isn't that. I'm not sure
> > if it makes a difference but these start URL's all contain /cgi-bin/ and the
>
> I'd make sure you've set the exclude_urls appropriately. Remember that the
> default is to exclude cgi-bin.

My exclude_urls is set to .gif

>Also check limit_urls_to. By default, it takes on the value of start_url,
>which won't do if you list very specific URLs in this parameter, because
>your limit_urls_to won't be open-ended enough to allow other URLs.

As an example, all of the URL's in my start_url look similar to

http://www.foo.ca/cgi-bin/foo2/foo3/foo4/rp_tocs_e?bcb_bcb3-00_78

except that the remaining part after the ? changes

and that page links you to several URL's that look like

http://different.server.ca/cgi-bin/blah/blah/blah/ViewDoc?journal=one&volume=2&file=3.pdf

where the info after the ? changes.

My limit_urls_to attribute looks like
http://www.foo.ca/cgi-bin/foo2/foo3/foo4/rp_tocs_e? \
http://different.server.ca/cgi-bin/blah/blah/blah/RPViewDoc

so I can't see a problem with that. The strange thing here is that it
goes through about 15 of the 50 start_url URLs and then merges. It
seems to me that htdig thinks that it is finished digging for some
reason and I can't pinpoint the reason why.

>So one way to get more information on this
>is to run htdig by itself and add the -vvvv flag for more debugging
>information.

I ran the dig with -vvv and the output seemed fine, it was following
all links, indexing the pdf's, and parsing them perfectly.

I'm stumped,
Sheri

------------------------------------
To unsubscribe from the htdig mailing list, send a message to
htdig-unsubscribe@htdig.org
You will receive a message to confirm this.



This archive was generated by hypermail 2b28 : Fri Aug 11 2000 - 02:02:39 PDT