Re: [htdig] .pdf and .doc-files


Subject: Re: [htdig] .pdf and .doc-files
From: David Adams (D.J.Adams@soton.ac.uk)
Date: Thu Jun 08 2000 - 07:31:57 PDT


On Thu, 8 Jun 2000 09:12:32 -0500 (CDT) Gilles Detillieux
<grdetil@scrc.umanitoba.ca> wrote:

> According to Andre Reuber:
> > I am beginner in operating with htdig. Ist there any possibility
> > to make a index on .doc, .pdf, .xls, ... files? Do I need any extra
> > source? Where can I get this source.
>
> See http://www.htdig.org/FAQ.html#q4.8
> and http://www.htdig.org/FAQ.html#q4.9
>
> The .xls files may be a bit more of a challenge. I'd recommend using
> doc2html for .doc & .pdf, and if you find and install the Excel to HTML
> converter, xlHtml, you could probably add it to doc2html as an extra
> converter fairly easily (if you have at least a minor understanding
> of Perl).
>

I don't think it is quite so simple: doc2html.pl (and
parse_doc and conv_doc) only use the "magic number" of the
file to decide which utility to use for conversion.

MS Word and Excel files can have the same magic number.

The easy solution is a separate conversion script for excel
files. The sophisticated solution is a more advanced
script which uses the information on MIME type passed to it.

I hadn't heard of xlHTML and would like to know more.
As an alternative, there is a simple .xls to .csv
conversion program available from the same site as catdoc.

> --
> Gilles R. Detillieux E-mail: <grdetil@scrc.umanitoba.ca>
> Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/~grdetil
> Dept. Physiology, U. of Manitoba Phone: (204)789-3766
> Winnipeg, MB R3E 3J7 (Canada) Fax: (204)789-3930

----------------------
David Adams
D.J.Adams@soton.ac.uk

------------------------------------
To unsubscribe from the htdig mailing list, send a message to
htdig-unsubscribe@htdig.org
You will receive a message to confirm this.



This archive was generated by hypermail 2b28 : Thu Jun 08 2000 - 05:21:53 PDT