Gilles Detillieux (email@example.com)
Wed, 17 Mar 1999 13:00:45 -0600 (CST)
According to Geoff Hutchison:
> On Wed, 17 Mar 1999, Frank Richter wrote:
> > I don't have "avi" in bad_extension, so this document is fetched and
> > interpreted as text? If so this should be changed...
> At the moment, ht://Dig isn't very smart about MIME. So it takes text as
> default and will attempt to index anything not explicitly excluded. More
> intelligent MIME analysis is on the slate for 3.2 (and needed for other
Let's make it a little smarter, shall we? First of all, a very bad patch
was applied in January to 3.1.0 (post-b4), which effectively disabled
Document::readHeader()'s checking of the content-type header. This was
in response to PR#91 in the bug tracking database.
> Date: Tue, 29 Dec 1998 07:49:09 -0800
> From: firstname.lastname@example.org
> To: email@example.com
> Subject: redirects with content-type
> Full_Name: Robert Barta
> Version: htdig-3.1.0b4
> OS: Solaris 5.6
> Submission from: (NULL) (188.8.131.52)
> When htdig/Document.cc encounters a redirect together with
> some Content-Type, the content-type will have preference:
> # telnet cgi1.bellacoola.com 80
> Trying 184.108.40.206...
> Connected to www.bellacoola.com.
> Escape character is '^]'.
> GET /adios.cgi/278?http%3a%2F%2Fwww%2Eaustria%2Eeu%2Enet%2F HTTP/1.0
> HTTP/1.1 302 Moved
> Date: Tue, 29 Dec 1998 15:04:56 GMT
> Server: Apache/1.3.3 (Unix) mod_perl/1.16
> Location: http%3a%2F%2Fwww%2Eaustria%2Eeu%2Enet%2F
> Connection: close
> Content-Type: application/x-httpd-cgi
> Connection closed by foreign host.
> This does not result in an redirect, but in an "not HTML" message.
> I applied the following patch:
> # diff Document.cc Document.cc.orig
> < if (returnStatus == Header_not_found &&
> < mystrncasecmp("text/", token, 5) != 0 &&
> > if (mystrncasecmp("text/", token, 5) != 0 &&
The problem with this is by the time the Content-type header is read,
readHeader has already found the status line, so returnStatus is
usually Header_ok at this point. The patch below should fix this,
without interfering with redirects with (an incorrect) content-type.
I've also added another test, to prevent it from rejecting types that
are handled by external parsers, but not built-in ones. Please give
this a whirl...
--- htdig/Document.cc.hdrbug Tue Feb 16 23:03:52 1999
+++ htdig/Document.cc Wed Mar 17 12:26:05 1999
@@ -525,7 +525,9 @@ Document::readHeader(Connection &c)
strtok(line, " \t");
char *token = strtok(0, "\n\t");
- if (returnStatus == Header_not_found &&
+ if ((returnStatus == Header_not_found ||
+ returnStatus == Header_ok) &&
+ !ExternalParser::canParse(token) &&
mystrncasecmp("text/", token, 5) != 0 &&
mystrncasecmp("application/postscript", token, 22) != 0 &&
mystrncasecmp("application/msword", token, 18) != 0 &&
I didn't remove application/postscript or application/msword from the
built-in tests, though they probably should be taken out, as they can
only be handled by external parsers. The test for text/ allows all
text/* types (other than text/plain) to be parsed by Plaintext.cc,
but nothing else should be treated as text by default. By keeping
msword in the list above, it will allow Plaintext.cc to give Word
documents a shot by default. It doesn't seem to do a good job of it,
though. ;-) The built-in Postscript parser is currently disabled,
and I think it ought to be taken out.
-- Gilles R. Detillieux E-mail: <firstname.lastname@example.org> Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/~grdetil Dept. Physiology, U. of Manitoba Phone: (204)789-3766 Winnipeg, MB R3E 3J7 (Canada) Fax: (204)789-3930 ------------------------------------ To unsubscribe from the htdig mailing list, send a message to email@example.com containing the single word "unsubscribe" in the SUBJECT of the message.
This archive was generated by hypermail 2.0b3 on Fri Mar 19 1999 - 17:32:54 PST