Re: htdig: External parser again


Andrew Scherpbier (andrew@contigo.com)
Wed, 07 Oct 1998 11:08:08 -0700


Geoff Hutchison wrote:
>
> >In fact i do not understand why this seems to be so complicated.
> >Htdig will be more customizable if it parses text files only, all other
> >files being handled via external parsers. With something like that in
> >htdig.conf:
>
> Well it would be more customizable if it handles external parsers well. But
> parsing the file directly to text may not be the best solution.

Definately not the best solution, but a good default if there is a
<doctypex>2text filter available.

> Many formats include graphics (which we may wish to keep track of), and
> some formats now include hyperlinks and/or URLs. And what about metadata?
> If I was to parse LaTeX documents, I'd want the title counted like the
> title of an HTML document, etc.

This is exactly why I started to implement the external parser stuff. It is
severly broken at present because:
1) Document.cc has a static list of content-types it recognizes
2) Document.cc has a maximum document size that is very likely to interfere
with external parsers.
3) Document.cc insists in reading the whole document into memory. Memory hog
city!

> I'm not going to address these issues in the htdig3 maintenance. However, I
> think this is a great topic for htdig4 development. Feedback is always
> welcome. :-)

#1 above is easy to fix in htdig3.
#2 and #3 are much easier to deal with if you have threads and good
synchronization. Java makes this trivial.

-- 
Andrew Scherpbier <andrew@contigo.com>
Contigo Software <http://www.contigo.com/>
----------------------------------------------------------------------
To unsubscribe from the htdig mailing list, send a message to
htdig-request@sdsu.edu containing the single word "unsubscribe" in
the body of the message.



This archive was generated by hypermail 2.0b3 on Sat Jan 02 1999 - 16:28:29 PST