Subject: [htdig3-dev] Fixes for valid_extensions
From: Warren Jones (wjones@tc.fluke.com)
Date: Tue Jan 11 2000 - 16:04:33 PST
I was very happy to find that the "valid_extensions" option has
been added in version 3.1.4 -- something like this is essential
given the rather chaotic nature of the web server that I have
to index. But I found that a couple changes were necessary
to make valid_extensions work the way I wanted it to.
If "valid_extensions" are defined, I'd like to retrieve URL's
without extensions *if_and_only_if* they represent a directory.
However, I found that all URL's without extensions are rejected
if the URL contains a fully qualified domain name, e.g.:
Retriever::IsValidURL() rejects this URL because it thinks
the extension is:
.com/bar/
The patch for Retriever.cc (included below) fixes this.
To insure that a URL without an extension will be retrieved
only if it's a directory, I modified URL::normalize() so that
a slash is appended to any URL that doesn't have an extension.
This guarantees that retrieval will fail if the URL is not
a directory. This works for me, but I'm not sure that it's
the best solution -- comments would be appreciated.
-- Warren Jones Fluke Corporation---------------------------- snip snip ----------------------------
Index: Retriever.cc =================================================================== RCS file: /home/wjones/src/CVS.repo/htdig/htdig/Retriever.cc,v retrieving revision 1.1.1.5 diff -c -r1.1.1.5 Retriever.cc *** Retriever.cc 1999/12/15 22:06:09 1.1.1.5 --- Retriever.cc 2000/01/11 00:28:29 *************** *** 702,707 **** --- 702,709 ---- // char *ext = strrchr(url, '.'); String lowerext; + if ( ext && strchr(ext,'/') ) // Ignore a dot if it's not in the + ext = NULL; // final component of the path. if (ext) { lowerext = ext;
Index: URL.cc =================================================================== RCS file: /home/wjones/src/CVS.repo/htdig/htlib/URL.cc,v retrieving revision 1.1.1.5 diff -c -r1.1.1.5 URL.cc *** URL.cc 1999/12/15 22:06:35 1.1.1.5 --- URL.cc 2000/01/11 23:09:26 *************** *** 469,474 **** --- 469,490 ---- removeIndex(_path); + if ( *config["valid_extensions"] != '\0' ) + { + // If we're only accepting valid extensions, then append + // a trailing slash to any URL without an extension. + // This insures that the only URL's without extensions + // we retrieve will be directories. + + char *slash = strrchr( _path, '/' ); + if ( ! slash || slash[1] != '\0' ) + { + char *dot = strrchr( _path, '.' ); + if ( dot <= slash ) + _path << "/"; + } + } + // // Convert a hostname to an IP address //
------------------------------------ To unsubscribe from the htdig3-dev mailing list, send a message to htdig3-dev-unsubscribe@htdig.org You will receive a message to confirm this.
This archive was generated by hypermail 2b28 : Tue Jan 11 2000 - 16:20:31 PST