Subject: Re: [htdig3-dev] external parsers
From: Geoff Hutchison (email@example.com)
Date: Wed Nov 24 1999 - 13:55:24 PST
Phew! I need to type faster. As I wrote a response to Tom, Gilles sent this!
At 12:00 PM -0600 11/22/99, Gilles Detillieux wrote:
>Now I'm a bit fuzzy on the history, because all of this happened over
>a year ago when I came on the scene, but I believe that external parser
I came on as maintainer about 18 months ago. I *do* know that I added
the PDF.cc parser, which was contributed about the same time I came
on. I just looked things up using CVSweb:
These indicate that the ExternalParser class dates to 1997.
So why did I add Sylvain's PDF parser when there was external_parser
support? It worked and as Gilles said, it's often difficult to write
a complete external parser! So since it was provided, I thought we
should go with it. At the time, it seemed like a good idea. Hindsight
>Yes, that's correct. Most document types other than HTML are
>currently dealt with just as plain text, ultimately, so all structural
>information is lost. The only exception to this is the latest version
>of parse_doc.pl, which has a hook in it to extract the title from PDFs
For example, there's a PDF library in Perl that supposedly lets you
grab various meta-information. However, no one has written an
external_parser that uses it. Even if it did, I don't know how useful
it would be since in general such information is only sparsely used.
>Yes, if someone could add a good, efficient and reliable XML parser to
>htdig, that would certainly be the way to go.
Such parsers exist. Hopefully someone might have a good idea about
how to use it!
>Yeah, but I think we'd need to get to the bottom of why exactly htdig
>is too slow. I don't think the current HTML parser is necessarily the
>model of efficiency either, so it may be that a well designed XML parser
>in its place wouldn't slow things down too much. I think attention really
>needs to be paid to the database back-end, and minimizing the amount of
>copying of huge strings that takes place in the current code.
We may want to start splitting into different threads. When we talk
about speed, we should be careful about what component we're talking
about. First off, Tom? Where did you hear it wasn't fast enough?
As far as the indexer, I'd guess the main slowdown comes in database
operations. String optimizations wouldn't hurt, but database lookups
kill us, esp on large databases. But careful profiling and
optimization on 3.2 still needs to be done.
> > Can htdig's config parsing handle multiple directives with the same
> > name? (If I recall correctly it only remembers the last one seen.) I
> > was just thinking that it might be cleaner to specify items like this
> > using multiple directives like this:
> > external_parser: text/html /usr/local/bin/htmlparser
> > external_parser: application/ms-word "mswordparser -w"
>The problem is you want to be able to override attributes that were
>defined previously, in an include file for example. I'd favour a
>different syntax for appending to an already defined attribute, e.g.:
> bad_extensions += .pdf
Right. Plus this sort of change will be a bit easier to do now that
the guts of the config parser was re-written by Vadim in bison/flex.
To unsubscribe from the htdig3-dev mailing list, send a message to
You'll receive a message confirming the unsubscription.
This archive was generated by hypermail 2b25 : Wed Nov 24 1999 - 14:07:48 PST