Re: [htdig] Suse 6.2 + htdig 3.1.5


Subject: Re: [htdig] Suse 6.2 + htdig 3.1.5
From: Geoff Hutchison (ghutchis@wso.williams.edu)
Date: Tue May 02 2000 - 14:20:48 PDT


At 11:19 PM +0300 5/2/00, Peter L. Peres wrote:
>the parent directory in any index. Since some directories which I index
>are not under the html root, the htdig actually tried to climb the whole
>directory tree up to '/' (a couple of GBs of disks) !!!

Aha. Good to hear that you figured it out. I'm still curious on the
"duplicate" issue though--do you actually see duplicate URLs?

>How can I do this ? (I will eventually hack it into the source - later). A
>pointer to the relevant source file/idea will be welcome. What is needed,

No need to hack the source. Try something like this:

start_url: http://localhost/docs/foo/
limit_urls_to: /docs/foo/ /docs/bar/ [etc]

>i.e. if page /a/b/c/d is indexed, then if it contains any hrefs:
>/a/b/c, /a/b or /a, they are to be ignored. However, /a/b/f should not be
>ignored, nor /a/b/c/e etc.

Sure. You just want to set the limit_urls_to and exclude_urls as
appropriate. Both accept multiple strings. If a URL matches any
pattern in limit_urls_to and DOESN'T match a pattern in exclude_urls,
it will be added to the "TODO" list.
See <http://www.htdig.org/attrs.html#limit_urls_to> and
<http://www.htdig.org/attrs.html#exclude_urls>

Absolute and relative URLs are resolved before they are compared with
the limits. So a URL of '/' would become http://localhost/ and would
be ignored. Similarly, a URL of '../' would become
http://localhost/docs/ and would be ignored. Neither one has a
pattern that matches in limit_urls_to.

If you wanted to restrict indexing to the /docs/ directory but wanted
to exclude a specific directory (like Java), you could do:

start_url: http://localhost/docs/
limit_urls_to: /docs/
exclude_urls: /docs/java/

>The way things are now, if one would index a page on geocities, f.ex., one
>would index the whole geocities, since each geocities page contains a

No, this isn't true. See above. Actually the default for
limit_urls_to is: limit_urls_to: ${start_url}

So if you only wanted to index everything in the htdig.org mailing
list archives, you'd just set:
start_url: http://www.htdig.org/mail/

Every URL that doesn't match this prefix will be ignored.
Effectively, you've limited the indexing to only that directory and
its subdirectories (i.e. each year and month).

My guess is that your limit_urls_to is a bit liberal. As I asked
earlier, what does your config file look like?

Cheers,

--
-Geoff Hutchison
Williams Students Online
http://wso.williams.edu/

------------------------------------ To unsubscribe from the htdig mailing list, send a message to htdig-unsubscribe@htdig.org You will receive a message to confirm this.



This archive was generated by hypermail 2b28 : Tue May 02 2000 - 12:08:25 PDT