Re: [htdig] Indexing Images using external parsers

Subject: Re: [htdig] Indexing Images using external parsers
From: Gilles Detillieux (
Date: Mon Mar 06 2000 - 09:29:39 PST

According to Geoff Hutchison:
> At 5:04 PM +0000 3/5/00, Rzepa, Henry wrote:
> >We noted with interest that htdig V 3.2 does not appear to have any config
> >flags for invoking an external image parser (are we correct?)
> Yes, this is correct. Actually, the htdig/ code is
> languishing a bit. It has not been cleaned up to use the new
> Transport code, so it still doesn't support HTTP/1.1 or any of those
> new features.
> I don't think it would be too hard to make this code call
> ExternalParser on an image if that's the route you're taking.
> >We have in fact modified the htdig source code to do this, invoking an
> >external parser for the purpose. Not sure yet how it might scale to sites
> >containing a very large number of images. It is of course also possible
> >to pass the content extracted from a GIF to other parsers for "added"
> As Doug said, show us the patches!

Wouldn't it be a simple matter of Retriever::got_image passing the new
URL it builds on to Retriever::got_href, and then defining an external
parser for various image types? Yes, for scalability, this feature
should probably be optional (yet another attribute, yay!!! :), to avoid
all the aborted GETs for file types we don't parse. Of course, if you
also want to index background images, you'd probably want to patch
to call got_image for background=... parameters on <body> and <td> tags.

Gilles R. Detillieux              E-mail: <>
Spinal Cord Research Centre       WWW:
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:    (204)789-3930

------------------------------------ To unsubscribe from the htdig mailing list, send a message to You will receive a message to confirm this.

This archive was generated by hypermail 2b28 : Mon Mar 06 2000 - 09:34:56 PST