[htdig] Using htdig to "tidy" HTML

Subject: [htdig] Using htdig to "tidy" HTML
From: Rzepa, Henry (h.rzepa@ic.ac.uk)
Date: Mon Jun 05 2000 - 06:22:32 PDT

We, along with the rest of the world, need to think about how to migrate
document collections to XHTML.

As an adjunct to our work with external parsers in htdig, which we use to
extract meta information from external file types (e.g. gif, vrml, svg, xml
and a whole host of chemical types) we thought it would be useful to try
to add the option of creating on the fly XHTML versions of each document
retrieved by htdig from the start_url directory. This can be done simply
using Dave Raggett's program Tidy, which seems pretty reliable (if not
always 100%). However, invoking Tidy seems to require it be defined
in conjunction with an external parser for the MIME type text/html.
This means entirely over-riding the internal text/html htdig parser.

Does anyone have any idea how to invoke both? I.e, the internal parser
to index the content of the html file, and also an external parser to
convert it on the fly to xhtml? (and before someone asks, no we
do not intend this to be done for large document collections,
since I suspect the process will be a slow one).


Henry Rzepa. +44 (0)20 7594 5774 (Office) +44 (0)20 7594 5804 (Fax) Dept. Chemistry, Imperial College, London, SW7 2AY, UK. http://www.ch.ic.ac.uk/rzepa/

------------------------------------ To unsubscribe from the htdig mailing list, send a message to htdig-unsubscribe@htdig.org You will receive a message to confirm this.

This archive was generated by hypermail 2b28 : Mon Jun 05 2000 - 04:12:34 PDT