Re: [htdig] Search results


Gilles Detillieux (grdetil@scrc.umanitoba.ca)
Tue, 10 Aug 1999 15:16:24 -0500 (CDT)


According to peter karlsson:
> > 1) which version of ht://Dig is causing you this problem?
>
> 3.1.2
>
> > 2) have you applied any patches to it, or made any changes at all to
> > the source? if so, what?
>
> I recently applied the patches that were posted to this list. The last
> databse update (of August 1st) was done using hte old version, though.

It's possible that the recent patch for HTTP header parsing would solve this
problem, but given the behaviour you're describing it seems unlikely to have
an impact. Still, it may be worth running a test run of htdig using just
a few of the problem documents as your start_url. If you run with -vvv,
I'd be interested in seeing what it's picking up for the Content-Type header.

> > 3) what is your OS version?
>
> Solaris 2.5.1.
>
> > 4) what does your htdig.conf file look like? (you may strip out comments
> > and any attributes you don't want to post to the list, but I'm interested
> > in seeing any attributes that may have an impact on what gets indexed
> > in the documents, and what gets stripped out)
>
> database_dir: /opt/www/htdig/db
> start_url: http://www.mds.mdh.se/ http://info.mds.mdh.se/ http://chess.mds.mdh.se/ http://mud.mds.mdh.se/ http://clay.mds.mdh.se/ http://parallax.mds.mdh.se/ http://proxy.mds.mdh.se/proxy/ http://hitta.mds.mdh.se/
> limit_urls_to: ${start_url}
[snip]
> locale: sv
> iso_8601: yes
> http_proxy: http://127.0.0.1:3128

I didn't see anything suspect above, but I'm curious about the proxy
server you're running on your local host. It seems that htdig is indexing
HTML files as though they were plain text, not HTML. I'm wondering if
it's possible that your proxy server is messing up the Content-Type
headers from your web servers, or your web servers themselves aren't
putting out the right headers.

I tried a search for "star wars" on your site, as you suggested, and it
turned up lots of pages. Most of these were plain text. Of the HTML
ones, some of the URLs ended in .html, most in .htm, and three just with
"/". All the .htm files had the problem, and none of the .html ones did.
Strangely, 2 of the 3 URLs ending in "/" had the problem.

In any case, the first thing I'd check is to make sure your proxy server
and your web servers are set up to return the right MIME type for .htm.
Next, for any documents that have the problem, have a look at what
"htdig -vvv" is picking up for their Content-Type header. If it's
text/plain, or text/* other than text/html, that's the cause of the
problem. You may also want to temporarily disable the http_proxy
attribute, if you can, just to see if it's the proxy server that's
the problem.

-- 
Gilles R. Detillieux              E-mail: <grdetil@scrc.umanitoba.ca>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:    (204)789-3930

------------------------------------ To unsubscribe from the htdig mailing list, send a message to htdig@htdig.org containing the single word unsubscribe in the SUBJECT of the message.



This archive was generated by hypermail 2.0b3 on Tue Aug 10 1999 - 13:41:37 PDT