Re: htdig: Re: ht://Dig and MSWord


Pirmin Kalberer (pka@mmls.ch)
Wed, 27 May 1998 08:05:19 +0000


Richard Jones wrote:
> Pirmin Kalberer wrote:
> >
> > Richard Jones wrote:
> > > Unfortunately, working out exactly how external parsers work
> > > was beyond my abilities & I gave up. The solution is definitely
> > > possible, using `catdoc' and a simple shell script. I suggest
> > > you maybe ask Andrew Scherpbier exactly how the external parsing
> > > mechanism works, and then you or I can work out how to connect
> > > up catdoc.
> > >
> >
> > We convert our Winword and Excel file with a Perl-Script which is
> > much better than catdoc. The three modules OLE-Storage, Unicode::Map
> > and Startup from Martin Schwartz can be found on CPAN. There
> > is a description in the May issue of the german Unix magazine 'iX'.
>
> In this situation, we can't run a script over the *.doc
> files to generate HTML (at least, we could, but it wouldn't
> be very easy at all ...). The Word files are all stored on NT,
> and NT of course can't export the filesystem usefully.

Our files are stored on NT, too! They are mounted on a Linux
machine with Samba (http://samba.anu.edu.au/samba/samba.html).
The static conversion keeps the access rights and is faster for
access (quick view of search results). I patched htdig to convert
links like file://nt/dir to http://www/mounted-nt/dir for digging and
back again for displaying the search results.

>
> I really think an external parser would be better, perhaps
> in conjunction with txt2html.
>

IMHO, an external parser for Winword-docs is a must.
It should be easy to call the perl script to convert a
Winword-Doc to ASCII and that's enough for indexing.

Pirmin

--
Pirmin Kalberer <pka@mmls.ch>
Mueller Martini Logistik-Systeme AG, CH-8031 Zuerich
Phone: +41 1 279 13 90  Fax: +41 1 279 12 63
----------------------------------------------------------------------
To unsubscribe from the htdig mailing list, send a message to
htdig-request@sdsu.edu containing the single word "unsubscribe" in
the body of the message.



This archive was generated by hypermail 2.0b3 on Sat Jan 02 1999 - 16:26:18 PST