RE: [htdig] puzzled by htdig


Subject: RE: [htdig] puzzled by htdig
From: GYGAX,OTTO (HP-Corvallis,ex1) (otto_gygax@hp.com)
Date: Thu Oct 05 2000 - 14:01:47 PDT


Thanks, Geoff, for getting back to me.

My limit_urls_to key is set as you have it below (default).
My start_url is currently set to a list of urls such as http://>/,
http://>/arch.html, http://>/dir1, http://>/dir2,
http://>/dir3, ... where arch.html is a simple web page with a href
pointer to
http://>/~arch, the cover page to the Mhonarc mailing tree
that contains links to every single mailing archive page.

Before I extended the start_url key attr., I only had http://>/ and
http://>/arch.html, but htdig went as far as the few links off the
server's index.html file, missing all other directories at the root. At one
point it somehow managed to follow the link in arch.html but that stopped and is
what I'm trying to resolve now.

By including all the other directories at the root I'm now getting a more
exhaustive database of search items as originally intended, but the one I really
need (~arch and the tree it points to) is still missing.

        -otto

----------------------------------------------
Otto A. Gygax (
Otto_Gygax@hp.com)
Digital Publishing Solutions, Software Development
Hewlett-Packard, Corvallis, Oregon
ph: (541)715-9098 / fax: (541)715-4980 / cell: (541)602-3491

-----Original Message-----
From: Geoff Hutchison [mailto:ghutchis@wso.williams.edu]
Sent: Wednesday, October 04, 2000 7:30 PM
To: GYGAX,OTTO (HP-Corvallis,ex1)
Cc: 'htdig@htdig.org'
Subject: Re: [htdig] puzzled by htdig

At 2:43 PM -0700 10/4/00, GYGAX,OTTO (HP-Corvallis,ex1) wrote:
>Now it won't work. htdig is able to look up other web pages that reside at the
>root of the web server but cannot traverse down to the ~arch tree.

There are a few points here and it is perhaps better to explain how
htdig follows links rather than to directly address your question.

In the htdig.conf file, there are two key attributes for your question:
start_url: http://www.foo.com/
limit_urls_to: ${start_url}

As set, this would start indexing at www.foo.com and go from there.
The limit_urls_to attribute requires that any URLs it finds match
this pattern. In this case, this will limit indexing to everything
inside this server. (You could, for example, just set it to "foo.com"
to index all servers in that domain, etc.) But it will *only* follow
links. So if you don't have a link from a file at the start_url to a
certain file, it won't index it.

Your example is a little unclear to me. My guess is that you are
either not using limit_urls_to correctly or you don't have working
links to the files you're trying to index.

For more information:
http://www.htdig.org/attrs.html#start_url
http://www.htdig.org/attrs.html#limit_urls_to

--
-Geoff Hutchison
Williams Students Online
http://wso.williams.edu/

------------------------------------ To unsubscribe from the htdig mailing list, send a message to htdig-unsubscribe@htdig.org You will receive a message to confirm this. List archives: <http://www.htdig.org/mail/menu.html> FAQ: <http://www.htdig.org/FAQ.html>



This archive was generated by hypermail 2b28 : Thu Oct 05 2000 - 14:05:00 PDT