Re: [htdig] excluding MIME-TYPE(s)


Subject: Re: [htdig] excluding MIME-TYPE(s)
From: Gilles Detillieux (grdetil@scrc.umanitoba.ca)
Date: Fri Feb 25 2000 - 10:02:30 PST


According to Bob Dusek:
> However, I'm having a problem. Just recently, htdig has been crapping
> out on us with a "segmentation fault." I have been running it with -vvv
> today, though. And, I've run it about 7 times. Each time it has
> crapped out. I know what is happening to an extent...
>
> htdig dies.
> I look in the log file.
> It died when it was digging and ran into a ".ppt" file.
> So, I add .ppt to the bad extensions list and start it again.
>
> I run htdig.
> htdig runs for a much longer period of time.
> htdig dies.
> I look in the log file.
> It died when it was digging and ran into a ".WAV" file.
> So, I add .WAV to the bad extensions list and start it again.
>
> etc.
> etc.
>
> I'm using Apache as my web server, and I've got my DefaultType setup as
> "application/octet-stream".
>
> I was just wondering if there was a way to list bad MIME-TYPE(s) instead
> of bad extensions? (I couldn't find any reference to this in the Docs or
> archives in the few hours I've been looking... I could've been looking
> in the wrong spot, though) This way, I could simply list
> "application/octet-stream" as a bad type and then carry on with the
> dig. I'm going crazy. Each time I add a new "bad extenstion", I think
> I've got it licked and then I get another one!
>
> I am running version 3.1.3 of htdig.

Could you provide us with stack backtraces at the time of the
segmentation faults? If you have a core file, and gdb, then the command
"gdb /path/to/htdig /path/to/core", then "bt" would do it. If you don't
have a core file, you could run htdig from gdb (using its "run" command)
until it segfaults, then do the "bt" command. Also provide the last dozen
or so lines of -vvv output before the segfault, which may be helpful. I'd
like to know what "Content-Type" header htdig is seeing.

If it's segfaulting, then there's a bug. You may also want to try 3.1.4,
or the very soon to be released 3.1.5, to see if the bug is still there.
What doesn't make sense to me is why htdig is doing anything at all with
these files. It only attempts to index files with MIME types text/html,
text/plain, text/* (treated as text/plain), and application/pdf, as well
as any types specified in your external_parsers attribute. If your web
server is indeed tagging .ppt and .wav files as application/octet-stream,
or even audio/x-wav (the type for .wav files in some mime.types files),
then htdig shouldn't even be fetching or looking at them. The behaviour
you describe seems to indicate that your server is actually defaulting
to text/plain, as is commonly the case. Even so, that may put junk in
your database, but it shouldn't segfault!

> BTW - I've thought about using the "valid extensions" config option, but
> I just read (from the archives) about the problems some folks have had
> with the "?" query strings and the "/directory" links, and I didn't
> understand how/where to apply the patch that was given.

This fix will be included in 3.1.5, which is due out very soon now.

-- 
Gilles R. Detillieux              E-mail: <grdetil@scrc.umanitoba.ca>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:    (204)789-3930

------------------------------------ To unsubscribe from the htdig mailing list, send a message to htdig-unsubscribe@htdig.org You will receive a message to confirm this.



This archive was generated by hypermail 2b28 : Fri Feb 25 2000 - 10:06:28 PST