Subject: Re: [htdig] Indexing Images using external parsers
From: Gilles Detillieux (firstname.lastname@example.org)
Date: Mon Mar 06 2000 - 09:29:39 PST
According to Geoff Hutchison:
> At 5:04 PM +0000 3/5/00, Rzepa, Henry wrote:
> >We noted with interest that htdig V 3.2 does not appear to have any config
> >flags for invoking an external image parser (are we correct?)
> Yes, this is correct. Actually, the htdig/Images.cc code is
> languishing a bit. It has not been cleaned up to use the new
> Transport code, so it still doesn't support HTTP/1.1 or any of those
> new features.
> I don't think it would be too hard to make this code call
> ExternalParser on an image if that's the route you're taking.
> >We have in fact modified the htdig source code to do this, invoking an
> >external parser for the purpose. Not sure yet how it might scale to sites
> >containing a very large number of images. It is of course also possible
> >to pass the content extracted from a GIF to other parsers for "added"
> As Doug said, show us the patches!
Wouldn't it be a simple matter of Retriever::got_image passing the new
URL it builds on to Retriever::got_href, and then defining an external
parser for various image types? Yes, for scalability, this feature
should probably be optional (yet another attribute, yay!!! :), to avoid
all the aborted GETs for file types we don't parse. Of course, if you
also want to index background images, you'd probably want to patch HTML.cc
to call got_image for background=... parameters on <body> and <td> tags.
-- Gilles R. Detillieux E-mail: <email@example.com> Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/~grdetil Dept. Physiology, U. of Manitoba Phone: (204)789-3766 Winnipeg, MB R3E 3J7 (Canada) Fax: (204)789-3930
------------------------------------ To unsubscribe from the htdig mailing list, send a message to firstname.lastname@example.org You will receive a message to confirm this.
This archive was generated by hypermail 2b28 : Mon Mar 06 2000 - 09:34:56 PST