Re: [htdig] re: parsing stuff


Subject: Re: [htdig] re: parsing stuff
From: Gilles Detillieux (grdetil@scrc.umanitoba.ca)
Date: Thu May 11 2000 - 13:33:40 PDT


According to gil cohen:
> Okay, here's a proram I wrote:
>
> -------------
> cat $1|tr -d '\12'|sed -e 's/.*<title>//' -e 's/<\/title>.*//' >> /test
> echo "Content-Type: text/html"
> echo ''
> cat $1
> -------------
>
> Then, I put the following in the config file:
> text/html->text/html "sh /RIDOF.sh"

That's not quite complete. You need to have "external_parsers: " in
front of that. However, that still won't work - in fact, it will cause
htdig's ExternalParser module to recursively call itself until it blows
its stack. Right now, external converters must convert one mime type
to a different type, and the chain must eventually lead to an actual
parser (whether internal or external).

You have two choices: you can modify the existing internal HTML parser,
or you can write a full external parser for HTML, that will grab all
the information you want from the documents, including links to other
documents.

-- 
Gilles R. Detillieux              E-mail: <grdetil@scrc.umanitoba.ca>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:    (204)789-3930

------------------------------------ To unsubscribe from the htdig mailing list, send a message to htdig-unsubscribe@htdig.org You will receive a message to confirm this.



This archive was generated by hypermail 2b28 : Thu May 11 2000 - 11:21:20 PDT