[htdig] Word parsers


Subject: [htdig] Word parsers
From: D.J.Adams@soton.ac.uk
Date: Wed Jan 26 2000 - 06:08:11 PST


I've done a quick investigation of two programs which parse Word
documents, and I thought it might interest others on the htdig list.

The two are Catdoc from http://www.fe.msk.ru/~vitus/catdoc/,
and Wp2html from http://www.res.bbsrc.ac.uk/wp2html/.

Catdoc is freeware. Wp2html is available for a small sum from a one-man
business, and the source code is made available. (It cost us, as a
University, a mere 25 pounds for the right to run it in one Unix server
and receive upgrades.)

I saved a Word97 document in Word2 and Word6 formats and then tried
to see if the programs could extract text from the files:

Version Catdoc Catdoc Wp2html
of Word 0.90a 0.91.2 version 3.2

2.0 Yes(1) Yes No

6.0 Yes(2) Yes No

97 Yes(2) Yes Yes

Notes (1) - Very large number of spurious characters output with text
      (2) - A few spurious characters at the end of output.

For conversion of Word documents to plain text Catdoc-0.91.2 is a clear
winner and comes bundled with a utility for creating CSV files from
Excel spread sheets. If you are using an earlier version of Catdoc then
there are good grounds for upgrading.

Wp2html is sold as a utility for converting WordPerfect documents to
HTML, and works with everything from version 5.1 to 8.0 that I have
tried. It is very configurable and I was able to get it to output plain
text without too much trouble. If you want to convert Word97 files into
HTML then it is the clear choice. It continues in development and we
may hope that later versions will cope with other Word formats.

Is somebody able to try these products with Word2000 ?

                

-- 
 
David J Adams
<D.J.Adams@soton.ac.uk>
Computing Services
University of Southampton

------------------------------------ To unsubscribe from the htdig mailing list, send a message to htdig-unsubscribe@htdig.org You will receive a message to confirm this.



This archive was generated by hypermail 2b28 : Wed Jan 26 2000 - 06:09:52 PST