Re: [htdig] Indexing binary files by filename


Subject: Re: [htdig] Indexing binary files by filename
From: Gilles Detillieux (grdetil@scrc.umanitoba.ca)
Date: Sat May 20 2000 - 15:39:28 PDT


According to Geoff Hutchison:
> At 4:56 PM +0100 5/19/00, Darrell Berry wrote:
> >"Indexing binary files by filename (simply need to write a minimal parser
> >for this)"
> >
> >its on the todo list---can i cast my vote for it happening soon? we have a
> >site which is about 50% text documents and 50% quicktime movies, soundfiles
> >etc, and being able to search for these media clips by filename would be a
> >godsend!
>
> Remember those textbooks that say "this is an exercise left to the
> reader?" This is my version. :-)
>
> The biggest catch is that htmerge will currently remove documents
> that don't have an excerpt. So you probably want a minimal script
> that returns something for a title and something for an excerpt. (My
> suggestion would be to return the file type as an excerpt, like
> "QuickTime movie" or "MP3 file" but anything is fine.)
>
> Then you'd probably want to remove these file types from the
> bad_extensions list.

I had a long and frustrating e-mail exchange with someone known only as
"System", back in August of last year, about a very similar problem -
indexing JPEG files by name. Unfortunately, it was conducted off-list,
so it's not archived on htdig.org. (Strangely enough, he kept asking
for opinions from others, even though he was e-mailing me privately.
I couldn't seem to get through to him enough for him to understand that
point, let alone the advice I was offering, hence the frustration.)

In any case, I suggested using an external parser that would spit out
the file name as text, which should solve the problem quite easily,
although it may be a bit on the slow side due to needless downloading
of the file contents which are ignored. Now, with external converter
support, the job is even easier. Here's an example:

external_parsers: application/mp3->text/html /usr/local/bin/spitname

where spitname is this script:

#!/bin/sh

echo "<html><head><title>`basename $3`: $2</title></head><body>"
file $1 | sed 's/^[^:]*: *//'
echo "</body></html>"

This will spit out the filename and mime type as the title, and the
output of the file command as the body, all of which will be indexed.
Of course, you could customise this to put out whatever text you want.

-- 
Gilles R. Detillieux              E-mail: <grdetil@scrc.umanitoba.ca>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:    (204)789-3930

------------------------------------ To unsubscribe from the htdig mailing list, send a message to htdig-unsubscribe@htdig.org You will receive a message to confirm this.



This archive was generated by hypermail 2b28 : Sat May 20 2000 - 13:28:06 PDT