Re: [htdig] .pdf and .doc-files


Subject: Re: [htdig] .pdf and .doc-files
From: Gilles Detillieux (grdetil@scrc.umanitoba.ca)
Date: Thu Jun 08 2000 - 07:45:02 PDT


According to David Adams:
> On Thu, 8 Jun 2000 09:12:32 -0500 (CDT) Gilles Detillieux
> <grdetil@scrc.umanitoba.ca> wrote:
> > According to Andre Reuber:
> > > I am beginner in operating with htdig. Ist there any possibility
> > > to make a index on .doc, .pdf, .xls, ... files? Do I need any extra
> > > source? Where can I get this source.
> >
> > See http://www.htdig.org/FAQ.html#q4.8
> > and http://www.htdig.org/FAQ.html#q4.9
> >
> > The .xls files may be a bit more of a challenge. I'd recommend using
> > doc2html for .doc & .pdf, and if you find and install the Excel to HTML
> > converter, xlHtml, you could probably add it to doc2html as an extra
> > converter fairly easily (if you have at least a minor understanding
> > of Perl).
> >
>
> I don't think it is quite so simple: doc2html.pl (and
> parse_doc and conv_doc) only use the "magic number" of the
> file to decide which utility to use for conversion.
>
> MS Word and Excel files can have the same magic number.

Oh, yuck!

> The easy solution is a separate conversion script for excel
> files. The sophisticated solution is a more advanced
> script which uses the information on MIME type passed to it.

It shouldn't be too hard to patch doc2html to do look at argument no. 2,
the mime type.

> I hadn't heard of xlHTML and would like to know more.
> As an alternative, there is a simple .xls to .csv
> conversion program available from the same site as catdoc.

I've heard nothing about it other than knowing of it's existance. Back when
the redhat-announce mailing list was still active, someone posted a message
announcing the availability of this program in RPM form. I made a mental
note to look into it someday, but haven't done so yet.

Would a .csv be plain, indexable ASCII text? If so, that should do the
trick too. You still have the problem of differentiating between word
and excel documents in your script in either case.

-- 
Gilles R. Detillieux              E-mail: <grdetil@scrc.umanitoba.ca>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:    (204)789-3930

------------------------------------ To unsubscribe from the htdig mailing list, send a message to htdig-unsubscribe@htdig.org You will receive a message to confirm this.



This archive was generated by hypermail 2b28 : Thu Jun 08 2000 - 05:34:57 PDT