Re: [htdig] Problem with content-type text/html; charset=SOMETHING


Subject: Re: [htdig] Problem with content-type text/html; charset=SOMETHING
From: Gilles Detillieux (grdetil@scrc.umanitoba.ca)
Date: Wed May 24 2000 - 11:51:58 PDT


According to Gordon Harty:
> I have some pages that I want to index that have something like:
>
> <meta http-equiv="Content-Type" content= "text/html; charset=windows-1252">
>
> or
>
> <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
>
> In both cases these files do not get indexed into the search database.
> I would like to have all pages that have a content type that begins
> with "text/html" to be indexed by the internal parser. Is there a way
> to do this?
>
> I'm running htdig 3.1.5.

There's nothing in htdig's HTML parser that would cause it to have
problems with meta tags like those above. It will simply ignore them.
The problem must be that your HTTP server doesn't report the correct
content-type for these pages to htdig. Most servers assign content-type
based on file name suffixes, as defined in a mime.types file, so you
might have to add definitions in there for the suffixes you use for your
files, or use the AddType directive in your .htaccess file if your server
is Apache. If your files have a suffix of .html or .htm, then the server
should tag them as text/html by default.

If htdig doesn't get the right content-type from the HTTP server, it won't
even download the content of the file, so it wouldn't matter even if htdig
did parse meta tags like those above - it would never see them if the file
suffix is wrong.

-- 
Gilles R. Detillieux              E-mail: <grdetil@scrc.umanitoba.ca>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:    (204)789-3930

------------------------------------ To unsubscribe from the htdig mailing list, send a message to htdig-unsubscribe@htdig.org You will receive a message to confirm this.



This archive was generated by hypermail 2b28 : Wed May 24 2000 - 09:40:51 PDT