[htdig3-dev] Parsing MS-Word Files


J. op den Brouw (MSQL_User@st.hhs.nl)
Mon, 08 Feb 1999 14:32:54 +0100


Hi all,

While we 'were talking about parsing Word files with catdoc,
maybe we should look at the status of MSWordView. It reads
Word 97 files and prints out HTML. Now HTML we can index
with the HTML parser build into htdig.

This is the same schema that PDF uses. Catdoc prints out plain
text with no markup, so all the words have equal score(?).
With HTML, you have different factors so it should help on the score.

Can someone shine a light on this.

--jesse
------------------------------------
To unsubscribe from the htdig3-dev mailing list, send a message to
htdig3-dev@htdig.org containing the single word "unsubscribe" in
the SUBJECT of the message.



This archive was generated by hypermail 2.0b3 on Mon Feb 08 1999 - 06:02:02 PST