htdig: parse_word_doc.pl revised


J. op den Brouw (MSQL_User@st.hhs.nl)
Thu, 10 Dec 1998 17:22:31 +0100


Hi all,

if you want to index Word files or have been doing so for a time,
there is a new parse_word_doc.pl at:

http://www.st.hhs.nl/htdig/parse_word_doc.pl.txt

features: code speedup (mucho!)
          matching patterns didn't work. now they match .,';: etc
               at the beginning or end of a word, not when in between.
               so endings. is changed to endings but 1,234,777.99
               stays that way... this is nice when you have URL's
               in your document.

You need catdoc to run this scheme. See the code.

--jesse
----------------------------------------------------------------------
To unsubscribe from the htdig mailing list, send a message to
htdig-request@sdsu.edu containing the single word "unsubscribe" in
the body of the message.



This archive was generated by hypermail 2.0b3 on Sat Jan 02 1999 - 16:29:50 PST