Re: [htdig] Indexing only HTML (again)


Gilles Detillieux (grdetil@scrc.umanitoba.ca)
Wed, 17 Mar 1999 13:00:45 -0600 (CST)


According to Geoff Hutchison:
> On Wed, 17 Mar 1999, Frank Richter wrote:
>
> > I don't have "avi" in bad_extension, so this document is fetched and
> > interpreted as text? If so this should be changed...
>
> At the moment, ht://Dig isn't very smart about MIME. So it takes text as
> default and will attempt to index anything not explicitly excluded. More
> intelligent MIME analysis is on the slate for 3.2 (and needed for other
> parts).

Let's make it a little smarter, shall we? First of all, a very bad patch
was applied in January to 3.1.0 (post-b4), which effectively disabled
Document::readHeader()'s checking of the content-type header. This was
in response to PR#91 in the bug tracking database.

> Date: Tue, 29 Dec 1998 07:49:09 -0800
> From: rho@austria.eu.net
> To: htdig3-bugs@htdig.org
> Subject: redirects with content-type
>
> Full_Name: Robert Barta
> Version: htdig-3.1.0b4
> OS: Solaris 5.6
> Submission from: (NULL) (193.154.142.3)
>
> When htdig/Document.cc encounters a redirect together with
> some Content-Type, the content-type will have preference:
>
> # telnet cgi1.bellacoola.com 80
> Trying 209.249.48.65...
> Connected to www.bellacoola.com.
> Escape character is '^]'.
> GET /adios.cgi/278?http%3a%2F%2Fwww%2Eaustria%2Eeu%2Enet%2F HTTP/1.0
>
> HTTP/1.1 302 Moved
> Date: Tue, 29 Dec 1998 15:04:56 GMT
> Server: Apache/1.3.3 (Unix) mod_perl/1.16
> Location: http%3a%2F%2Fwww%2Eaustria%2Eeu%2Enet%2F
> Connection: close
> Content-Type: application/x-httpd-cgi
>
> Connection closed by foreign host.
>
> This does not result in an redirect, but in an "not HTML" message.
> I applied the following patch:
>
> # diff Document.cc Document.cc.orig
> 605,606c605
> < if (returnStatus == Header_not_found &&
> < mystrncasecmp("text/", token, 5) != 0 &&
> ---
> > if (mystrncasecmp("text/", token, 5) != 0 &&

The problem with this is by the time the Content-type header is read,
readHeader has already found the status line, so returnStatus is
usually Header_ok at this point. The patch below should fix this,
without interfering with redirects with (an incorrect) content-type.
I've also added another test, to prevent it from rejecting types that
are handled by external parsers, but not built-in ones. Please give
this a whirl...

--- htdig/Document.cc.hdrbug Tue Feb 16 23:03:52 1999
+++ htdig/Document.cc Wed Mar 17 12:26:05 1999
@@ -525,7 +525,9 @@ Document::readHeader(Connection &c)
                 strtok(line, " \t");
                 char *token = strtok(0, "\n\t");
                                 
- if (returnStatus == Header_not_found &&
+ if ((returnStatus == Header_not_found ||
+ returnStatus == Header_ok) &&
+ !ExternalParser::canParse(token) &&
                     mystrncasecmp("text/", token, 5) != 0 &&
                     mystrncasecmp("application/postscript", token, 22) != 0 &&
                     mystrncasecmp("application/msword", token, 18) != 0 &&

I didn't remove application/postscript or application/msword from the
built-in tests, though they probably should be taken out, as they can
only be handled by external parsers. The test for text/ allows all
text/* types (other than text/plain) to be parsed by Plaintext.cc,
but nothing else should be treated as text by default. By keeping
msword in the list above, it will allow Plaintext.cc to give Word
documents a shot by default. It doesn't seem to do a good job of it,
though. ;-) The built-in Postscript parser is currently disabled,
and I think it ought to be taken out.

-- 
Gilles R. Detillieux              E-mail: <grdetil@scrc.umanitoba.ca>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:    (204)789-3930
------------------------------------
To unsubscribe from the htdig mailing list, send a message to
htdig@htdig.org containing the single word "unsubscribe" in
the SUBJECT of the message.



This archive was generated by hypermail 2.0b3 on Fri Mar 19 1999 - 17:32:54 PST