Re: htdig: External parser again

Andrew Scherpbier (
Wed, 07 Oct 1998 11:08:08 -0700

Geoff Hutchison wrote:
> >In fact i do not understand why this seems to be so complicated.
> >Htdig will be more customizable if it parses text files only, all other
> >files being handled via external parsers. With something like that in
> >htdig.conf:
> Well it would be more customizable if it handles external parsers well. But
> parsing the file directly to text may not be the best solution.

Definately not the best solution, but a good default if there is a
<doctypex>2text filter available.

> Many formats include graphics (which we may wish to keep track of), and
> some formats now include hyperlinks and/or URLs. And what about metadata?
> If I was to parse LaTeX documents, I'd want the title counted like the
> title of an HTML document, etc.

This is exactly why I started to implement the external parser stuff. It is
severly broken at present because:
1) has a static list of content-types it recognizes
2) has a maximum document size that is very likely to interfere
with external parsers.
3) insists in reading the whole document into memory. Memory hog

> I'm not going to address these issues in the htdig3 maintenance. However, I
> think this is a great topic for htdig4 development. Feedback is always
> welcome. :-)

#1 above is easy to fix in htdig3.
#2 and #3 are much easier to deal with if you have threads and good
synchronization. Java makes this trivial.

Andrew Scherpbier <>
Contigo Software <>
