Re: [htdig] Problem with content-type text/html; charset=SOMETHING


Subject: Re: [htdig] Problem with content-type text/html; charset=SOMETHING
From: Gordon Harty (gordonh@TAGnet.org)
Date: Wed May 24 2000 - 15:44:44 PDT


I am using Apache to server both of the examples that I listed. One
of them is named "index.html". The other is "index.htm". Apache by
default has html and htm as text/html. I'm also indexing thousands of
other files on my site that have the .html and .htm extensions so I'm
not sure what could be unique about these pages.

The content-type was the only thing I could think would be different.
Both of these pages were being indexed. The meta data was added to
both pages and then around the same time (I can't confirm that at
exactly the same time but within the same week or so) these two sites
were not indexed.

Would Apache be picking up this meta information and feeding it to
htdig?

Gordon

On Wed, May 24, 2000 at 01:51:58PM -0500, Gilles Detillieux wrote:
> According to Gordon Harty:
> > I have some pages that I want to index that have something like:
> >
> > <meta http-equiv="Content-Type" content= "text/html; charset=windows-1252">
> >
> > or
> >
> > <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
> >
> > In both cases these files do not get indexed into the search database.
> > I would like to have all pages that have a content type that begins
> > with "text/html" to be indexed by the internal parser. Is there a way
> > to do this?
> >
> > I'm running htdig 3.1.5.
>
> There's nothing in htdig's HTML parser that would cause it to have
> problems with meta tags like those above. It will simply ignore them.
> The problem must be that your HTTP server doesn't report the correct
> content-type for these pages to htdig. Most servers assign content-type
> based on file name suffixes, as defined in a mime.types file, so you
> might have to add definitions in there for the suffixes you use for your
> files, or use the AddType directive in your .htaccess file if your server
> is Apache. If your files have a suffix of .html or .htm, then the server
> should tag them as text/html by default.
>
> If htdig doesn't get the right content-type from the HTTP server, it won't
> even download the content of the file, so it wouldn't matter even if htdig
> did parse meta tags like those above - it would never see them if the file
> suffix is wrong.
>
> --
> Gilles R. Detillieux E-mail: <grdetil@scrc.umanitoba.ca>
> Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/~grdetil
> Dept. Physiology, U. of Manitoba Phone: (204)789-3766
> Winnipeg, MB R3E 3J7 (Canada) Fax: (204)789-3930
>
> ------------------------------------
> To unsubscribe from the htdig mailing list, send a message to
> htdig-unsubscribe@htdig.org
> You will receive a message to confirm this.

------------------------------------
To unsubscribe from the htdig mailing list, send a message to
htdig-unsubscribe@htdig.org
You will receive a message to confirm this.



This archive was generated by hypermail 2b28 : Wed May 24 2000 - 13:37:31 PDT