Re: [htdig] Irrelevant sites visited by htdig when including site list in

Subject: Re: [htdig] Irrelevant sites visited by htdig when including site list in
From: Gilles Detillieux (
Date: Tue Feb 01 2000 - 09:03:15 PST

According to Paul Cauberg:
> I've been trying to include a file which containts a list (1000) of sites
> which htdig
> must visit. I've set limit-url to the same file (I've also tried setting
> it to $start_url),
> in both cases for some reasons htdig starts visiting lots of other sites.
> When
> I just put the 1000 sites in the config-file (which is a terrible job,
> because hardly
> any editor support so many stuff on one line) everything does work fine.

You never need to put everything on one line. Within a config file,
you can break up a line by ending with a backslash (\) and continuing
on the next line. E.g.:

bad_extensions: .wav .gz .z .sit .au .zip \
                .tar .hqx .exe .com .gif \
                .jpg .jpeg .aiff .class .map .ram \
                .tgz .bin .rpm .mpg .mov .avi

Within files used in the file expansion of attributes (`file`) you can
break up lines wherever you want - all white space is treated as equivalent
in these files.

> The relevant
> part of my config-file looks like this:
> start_url: `/home/limburg/sites.htdig`
> limit_urls_to: `/home/limburg/sites.htdig`

I can see no good reason why this would not work. Ideally, in
sites.htdig, you'd have one URL per line, but this is not essential.
You can use spaces, tabs, newlines or carriage-returns to separate
URLs. Note that you can not include comments in this file. There is
a small bug in the handling of this feature right now, though, in that
it will not properly handle lines greater than 1000 characters long.
Lines longer than this in the included file will be broken up into 1000
character pieces, so if you have a very long URL in there, it could get
split in two.

Also be aware that limit_urls_to will allow any URLs that match one of
the patterns provided, but not necessarily limit to those URLs alone.
E.g., if you have an URL like "" in your
sites.htdig, it will not only dig that one directory index, but will
allow any links to urls under that directory as well.

Gilles R. Detillieux              E-mail: <>
Spinal Cord Research Centre       WWW:
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:    (204)789-3930

------------------------------------ To unsubscribe from the htdig mailing list, send a message to You will receive a message to confirm this.

This archive was generated by hypermail 2b28 : Tue Feb 01 2000 - 09:05:06 PST