Re: [htdig] start_url and limit_url_to


Gabriel Fenteany (fenteany@calvin.bwh.harvard.edu)
Mon, 03 May 1999 21:47:48 -0400


 
>>Can I change the limit_urls_to another generic string that will get every
>>file on "http://foo3.com/" and sub-directories that is linked to
>>"http://foo3.com/foofile.html"?
>
> That's what I was trying to say before. Set:
>
> limit_urls_to: http://foo3.com/
>
> This will match everything on that server.
>

So this'll work even if they don't have an index file of any sort on that
server (or the relevant directory)? In other words, a more accurate example
of the problem is that someone has a URL
"http://foo3.com/foostuff/foofile.html" without an index file. If you went
there using a browser and typed in "http://foo3.com/foostuff/", you would
get an error.

The drawback is, if that's true, is that it'll index even stuff that is not
meant to be linked to be linked to the "entry page," right?

Some of these are big gov't sites where they have a mix of documents related
to different "sites" in the same directory. I was hoping I could have the
indexer get only the files that were linked to a page of arbitrary filename,
rather than everything on the server or the defined directory.

For the example above, what they would want me to do is index everything
located in "http://foo3.com/foostuff/" BUT only if it is linked to
"foofile.html"

But, for all but the most terrible sites I need to index, maybe the solution
you give will work. For the truly terrible ones, well either they fall in
line or only their starting_url gets indexed.

Thanks.

Gabriel

 
------------------------------------
To unsubscribe from the htdig mailing list, send a message to
htdig@htdig.org containing the single word "unsubscribe" in
the SUBJECT of the message.



This archive was generated by hypermail 2.0b3 on Mon May 03 1999 - 18:57:12 PDT