Re: [htdig3-dev] Parsing MS-Word Files


Geoff Hutchison (ghutchis@wso.williams.edu)
Mon, 8 Feb 1999 08:53:32 -0400


>While we 'were talking about parsing Word files with catdoc,
>maybe we should look at the status of MSWordView. It reads
>Word 97 files and prints out HTML. Now HTML we can index
>with the HTML parser build into htdig.

Several people have pointed out the utility of having "pass-through"
ExternalParsers. So a class called something like "ExternalFilter" might be
a good idea. The filter would take the file, perform some action (say
gunzip or MSWordView) and pass it back for further parsing. The class would
look somewhat like the ExternalParser class, but a bit simpler since it
obviously doesn't actually do any parsing. :-)

The only snag in this plan is figuring out the MIME type after filtering.
In particular, an uncompress filter would be fairly general and would have
a hard time knowing what it produced. However, if we add better MIME code
to the Retriever, this can be done internally.

Cheers,
-Geoff

------------------------------------
To unsubscribe from the htdig3-dev mailing list, send a message to
htdig3-dev@htdig.org containing the single word "unsubscribe" in
the SUBJECT of the message.



This archive was generated by hypermail 2.0b3 on Mon Feb 08 1999 - 06:17:16 PST