Re: [htdig] Using htdig to "tidy" HTML


Subject: Re: [htdig] Using htdig to "tidy" HTML
From: Torsten Neuer (tneuer@inwise.de)
Date: Mon Jun 05 2000 - 23:22:56 PDT


Geoff Hutchison wrote:
>
> On Mon, 5 Jun 2000, Rzepa, Henry wrote:
>
> > retrieved by htdig from the start_url directory. This can be done simply
> > using Dave Raggett's program Tidy, which seems pretty reliable (if not
> > always 100%). However, invoking Tidy seems to require it be defined
> > in conjunction with an external parser for the MIME type text/html.
> > This means entirely over-riding the internal text/html htdig parser.
>
> Alas the problem here is that invoking the external converter feature
> would produce an infinite loop. Setting text/html -> text/html would just
> call the converter again. :-( [The feature here is that you might have a
> converter to gunzip files which then produces PDF files to go to another
> converter.]
>
> I guess the ExternalParser code could be changed so that a converter
> producing text/plain or text/html (or any future internal mime-types)
> passes it off to the internal code.
>
> That said, I thought there was some sort of command-line tool to "spider"
> with Tidy already. Maybe that was something dreamed up by one of my
> friends at school. Still, it seems like a shell script around Tidy would
> be better.

I only can think of a two-step process here, which has Ht://Dig produce
a URL logfile which is piped through sort | uniq and fed to the tidy
pro-
gram afterwards. A simple shell script which serves as an extension to
the rundig script should do.

cheers,

  Torsten

-- 
InWise - Wirtschaftlich-Wissenschaftlicher Internet Service GmbH
Waldhofstraße 14                            Tel: +49-4101-403605
D-25474 Ellerbek                            Fax: +49-4101-403606
E-Mail: info@inwise.de            Internet: http://www.inwise.de

------------------------------------ To unsubscribe from the htdig mailing list, send a message to htdig-unsubscribe@htdig.org You will receive a message to confirm this.



This archive was generated by hypermail 2b28 : Mon Jun 05 2000 - 21:13:12 PDT