Re: [htdig] accessing sites whose entry pages are not index.html

Gilles Detillieux (
Thu, 29 Apr 1999 17:15:14 -0500 (CDT)

According to Gabriel Fenteany:
> You have intuited my real question. I wanted to know more about these
> attributes. Two questions: if I DO NOT have a remove_default_doc defined
> in htdig.conf (so the default is index.html, right?), will htdig be able to


> dig a site the is where the index file of this site is named
> "index.htm" or "default"htm" AND is accessible by a browser (so they did use
> DirectoryIndex for Apache or the equivalent for other Web servers)?

Yes, it will still be able to dig the site. The only problem is you
may wind up with some essentially duplicate URLs. For instance, if you
dig and somewhere on that site htdig follows a link to, it will add it to the index, not realising
that this is the same document as, so searches for a word
in that document will turn up both URLs. With default.htm added to the
remove_default_doc list, htdig will strip off that name and realise that
it's already indexed that directory.

Ideally, remove_default_doc should be set to the same list as the
DirectoryIndex for all the servers you dig. Unfortunately, you may not
have that sort of control over the sites you dig, so you may have to
settle for the least common set of names allowed as directory indexes
on all servers, and put up with a few duplicate entries in your database.

What you want to avoid is having names in that list that may be used on
some servers as something other than a directory index. For instance, if has a default.htm file on it, but this is different than
its index.html in the same directory, and only index.html is the directory
index, then you don't want to put default.htm in remove_default_doc.

> I am wondering this because I indexed earlier, and after found that a site
> in the starting_urls that with an entry point of the equivalent of
> did not appear in the database for searches. I
> checked out the site, by typing the "URL" without explicit name of index
> HTML file and I got there. Then I discovered that its index file is called
> "index.htm" Could htdig have missed indexing this site because the path
> was http::// and it didn't know to look for index.htm instead
> of index.html? Clearly, the browser found it so their server is configured
> right.

Well, htdig will get whatever the server gives it, so it should be the
same thing what your browser gets, unless you use local_urls to bypass
the server. However, htdig only follows <a href=...> links to other
documents, so if you point htdig at, and in recursively
following all the hrefs on that site, it doesn't come across an href to
"/stuff/" or to "", then it won't index the "stuff"
subdirectory. In fact, it won't even know it's there!

> Why don't I just add index.htm to that site in the starting_url of the
> htdig.conf file? Because I have a lot of other sites too that don't have an
> explicit URL, and I don't want to look it up for all of them.
> If I don't insert a remove_default_doc attribute and the default is used,
> will the sites described above be properly accessed and index by htdig?

They should be, as long as the hypertext links are intact. Note also that
htdig does not follow JavaScript links, only HTML links.

> Lesser question: does local_default_doc take multiple index file names? I
> know it didn't as of 0.1 revisions ago, from what I was reading in the
> archives.

No, still no change there. It only takes a single name, so use the most
commonly used name for a directory index file, among the directories you
dig via local_urls. For any other allowed directory index file name, htdig
won't find it, so it'll fall back to the HTTP server, which in most cases
works just fine. You see why changing this hasn't been a big priority.

> If I don't set either remove_default_doc or local_default_doc in the
> htdig.conf file, will htdig all the sites above for which I don't have the
> full path explicit URL of the entry page and which don't use "index.html" as
> their entry page (but are properly configured since the Web browsers access
> them)????

It should work, subject to the notes I've made above.

> Thanks a bunch and we all indeed appreciate all of the developers' great
> work on htdig. Long live ht://Dig and long live GNU!

Amen! :-)

Gilles R. Detillieux              E-mail: <>
Spinal Cord Research Centre       WWW:
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:    (204)789-3930
To unsubscribe from the htdig mailing list, send a message to containing the single word "unsubscribe" in
the SUBJECT of the message.

This archive was generated by hypermail 2.0b3 on Thu Apr 29 1999 - 15:24:23 PDT