Re: [htdig] Indexing only HTML (again)

Gilles Detillieux (
Tue, 16 Mar 1999 10:59:24 -0600 (CST)

According to Geoff Hutchison:
> >>I want to index only html files. Having limit_url_to set to html
> >>and limit normalised set to ${start_url} almost worked.
> >>The only problem was some servers (all?) interpret a URL ending
> >>in a / as /index.html and many authors use this to skip the
> >>index.html. Any ideas about how to get around this?
> I will make the assumption that your server only serves HTML files when a
> URL ends in /?
> How about:
> limit_url_to: html /
> limit_normalized: ${start_url}
> I haven't tried it, but I suspect it should work...

I don't know about that. I think it will match a "/" anywhere in the URL,
so it will accept just about anything.

I think a better approach, rather that explicitly setting what you want
to include, would be to explicitly set what you want to exclude. After
all, htdig can only handle a few file types on its own. If you don't
want to index .pdf files, add .pdf to bad_extensions, and if you don't
want to index plain text files, add .txt & .asc to bad_extensions. I
guess that would still leave in text files that have no extension in the
name, though, so if you have some of those, this approach may not be
complete enough. Mind you, for those, you could override the built-in
plain text parser with an external parser that does nothing. So, why
don't you give this a try?

bad_extensions: .wav .gz .z .sit .au .zip .tar .hqx .exe .com .gif .jpg .jpeg \
        .aiff .class .map .ram .tgz .bin .rpm .mpg .mov .avi .pdf .txt .asc .ps

external_parsers: text/plain /bin/true

This would cause all files with bad extensions to be skipped over entirely,
and plain text files with no extension would be fetched but not indexed.

Gilles R. Detillieux              E-mail: <>
Spinal Cord Research Centre       WWW:
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:    (204)789-3930
To unsubscribe from the htdig mailing list, send a message to containing the single word "unsubscribe" in
the SUBJECT of the message.

This archive was generated by hypermail 2.0b3 on Wed Mar 17 1999 - 10:05:13 PST