Re: [htdig] Indexing Images using external parsers


Subject: Re: [htdig] Indexing Images using external parsers
From: Gilles Detillieux (grdetil@scrc.umanitoba.ca)
Date: Mon Mar 06 2000 - 09:29:39 PST


According to Geoff Hutchison:
> At 5:04 PM +0000 3/5/00, Rzepa, Henry wrote:
> >We noted with interest that htdig V 3.2 does not appear to have any config
> >flags for invoking an external image parser (are we correct?)
>
> Yes, this is correct. Actually, the htdig/Images.cc code is
> languishing a bit. It has not been cleaned up to use the new
> Transport code, so it still doesn't support HTTP/1.1 or any of those
> new features.
>
> I don't think it would be too hard to make this code call
> ExternalParser on an image if that's the route you're taking.
>
> >We have in fact modified the htdig source code to do this, invoking an
> >external parser for the purpose. Not sure yet how it might scale to sites
> >containing a very large number of images. It is of course also possible
> >to pass the content extracted from a GIF to other parsers for "added"
>
> As Doug said, show us the patches!

Wouldn't it be a simple matter of Retriever::got_image passing the new
URL it builds on to Retriever::got_href, and then defining an external
parser for various image types? Yes, for scalability, this feature
should probably be optional (yet another attribute, yay!!! :), to avoid
all the aborted GETs for file types we don't parse. Of course, if you
also want to index background images, you'd probably want to patch HTML.cc
to call got_image for background=... parameters on <body> and <td> tags.

-- 
Gilles R. Detillieux              E-mail: <grdetil@scrc.umanitoba.ca>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:    (204)789-3930

------------------------------------ To unsubscribe from the htdig mailing list, send a message to htdig-unsubscribe@htdig.org You will receive a message to confirm this.



This archive was generated by hypermail 2b28 : Mon Mar 06 2000 - 09:34:56 PST