Re: [htdig] accessing sites whose entry pages are not index.html


Torsten Neuer (tneuer@inwise.de)
Thu, 29 Apr 1999 13:14:15 +0200


According to Gabriel Fenteany:
>Hi. I am indexing a large number of different servers. Some of the URLs
>point to http://foo.com/ but apparently the index file is not index.html but
>index.htm or default.html. Will htdig dig a site right if http://foo.com/
>uses "index.htm" and not "index.html" If it would NOT dig the site with
>the the more standard index filename, what is the switch I'd use in the
>htdig.conf Point is, I don't want to have to check what the name of the
>entry page of all these kinds of sites are.

First of all it depends upon the configuration of the HTTP server that
runs the web site you intend to index. If the server is unable to answer
a request to "http://www.foo.com/" by returning the correct index document,
blame it on the webmaster or sysadmin of that site.

ht://Dig will index just *anything* that is returned by the HTTP server
on request and that is recognized as being indexable (check the "bad_urls"
configuration option for that).

>I indexed a big list of sites, and most come up...but so far of the ones
>I've checked, only the ones that deviate from "index.html" are not showing
>up when the URL I have for them is http://foo.com/

Check out those sites with a web browser (Lynx works best for that - if you
can't get to it with Lynx, any robot will fail, too - or even request the
index document via telnet'ing to the HTTP port of that site) and have a look at
the code. ht://Dig will by no means index a side that uses JavaScript or any
other funky stuff to redirect pages.. it might also refuse those "refresh" meta
tags in some pages. HTTP "Location" headers should be of no problem though.

Remember also to have a look at "Robots" tags and the "/robots.txt" file.
If any of those forbids digging, you'll go hungry ,-)

Btw.. what do you refer to as "the more standard index filename"?

>Simply love htdig! But need help to get those last few sites.
>
>Thanks!
>
>Gabriel

hth,
  Torsten

--
InWise - Wirtschaftlich-Wissenschaftlicher Internet Service GmbH
Waldhofstraße 14                            Tel: +49-4101-403605
D-25474 Ellerbek                            Fax: +49-4101-403606
E-Mail: info@inwise.de            Internet: http://www.inwise.de

------------------------------------ To unsubscribe from the htdig mailing list, send a message to htdig@htdig.org containing the single word "unsubscribe" in the SUBJECT of the message.



This archive was generated by hypermail 2.0b3 on Thu Apr 29 1999 - 04:56:22 PDT