[htdig] How to include external list to start_url:

Gabriel Fenteany (fenteany@calvin.bwh.harvard.edu)
Sat, 01 May 1999 15:38:02 -0400

Hello. Thanks esp. to Geoff and Gilles for helping me to address small
problems that I was getting hammered for by some of the people who maintain
some of the sites I'm indexing.

One complaint I am getting is that certain individuals, who start a site not
with an index page but with some dumb filename as their entry page (so you
have to include "dumbfilename.html" in the URL in start_url), can't grasp
the concept of why htdig is not indexing their local linked pages. I keep
asking them either to change the entry page to the right index page for
their Web server's configuration, so I can put in a "/" at then end of the
domain for their site and local linked files will be indexed, or to e-mail a
list of all the darn URLs they want indexed. Strangely, rather than
changing dumbfilename.html to the right index page for their server, some of
them actually prefer to send me the URLs they want indexed separately, which
I then have to add to start_url. I don't understand some people, but
anyway... I assume there are no other easy current solutions to such a
conundrum, right? The robot after all cannot divine the intent of people
who write awful sites and pages, and there's no easy way to teach the robot
this is there?

I read that you can set start_url to read an external file of all the sites
to index, but I couldn't find how to do it? I'd really appreciate it if you
told me, and also should the file be a .txt file with tab- or
space-delimited URLs? Anything else to include in the external URL list
file? And, does this file have to be located on the local server or can
htdig use http to find it elsewhere (may be useful in the future, though
it's going to be local)?

Finally, can you also use an external file for limit_url too?

Thanks a million!

Gabriel Fenteany

To unsubscribe from the htdig mailing list, send a message to
htdig@htdig.org containing the single word "unsubscribe" in
the SUBJECT of the message.

This archive was generated by hypermail 2.0b3 on Sat May 01 1999 - 12:52:35 PDT