Re: htdig: External document parser??

Michael J. Long (
Tue, 30 Sep 1997 09:22:20 -0400

Markus wrote:
> if i got the problem right.
> you want htdig be able to parse eg. gif files and extract the
> relevant information to store in its datbae?
> this you want to have done by a somehow seperate module?

Yep. Looking at the source code, this is what is done now.
There is a seperate module for text and another for postscript.
For the GIF files, a module could index the comments in the
GIF. Of course, I know very few people that actually use that
feature of GIF.

> that would be really great!!!!!!

I know. A very intelligent move by the htdig folks.

> so what about the aproach to put all parsers into a dynamic
> loadable library?

What if each module would be a seperate dynamic library that
htdig would load at startup? Sort of like how Netscape deals
with plug-ins. For those of you not familiar, NS searches
a directory path and queries all plug-ins in those paths for
the MIME type that they handle. NS then registers the plug-in
to handle any file that matches that MIME type.

This would work perfectly for htdig. That way, instead of
having to alter the source code (Document::getParsable) every
time a new module is created, htdig could get this information
dynamically from the "plug-ins" themselves.

I will be more than happy to help adding this functionality.

> this library would have to export a certain set of funktions
> (eg getTitle(file) ). if the requestet type of information can
> be extracted by the parser, the library deliveres it.

Sounds like Markus is describing the exact same thing.

> one would have to supply an additional configuration directive
> to associate a file type with a library like mime types.

Well, maybe he is thinking similar but not the same. :^)

> the most generous approach would be using the nss interface of
> glibc2 that would make it a service available to any software
> on that machine.

Unfortunately, I am not familiar with nss. Could anyone guide
me to some documentation/information??

> everybody in the world would have to supply such a library for
> his file formats :) that would cause adobe a lot of work.

If life were perfect, I would agree, but we all know what life
is, or rather is not.

Cause Adobe a lot of work?? I doubt it. They most likely already
have a PDF (or PS, or Frame, or Pagemaker, etc.) to text converter
written. It would just be a matter of passing that text to the
text parser that is already included with htdig.

While I am thinking of it, I am thinking about writing a module
for Frame files and I need a little (theoretical) help. What I
was thinking of doing was calling an external program (fmbatch)
to convert the Frame file to text and then have the text parser
parse the text file. That way, I can avoid the work of having to
decode the Frame file and I also wouldn't have to write a text
parser all over again. Is this possible??


Michael J. Long

P.S. Thanks for such a great tool. All we need for it is a
     programmers reference.

* Michael J. Long * #include <disclaimer.h>
*   Summa Four    * Work: mjlong@Summa4.COM
* Manchester, NH  * Play:
To unsubscribe from the htdig mailing list, send a message to containing the single word "unsubscribe" in
the body of the message.

This archive was generated by hypermail 2.0b3 on Sat Jan 02 1999 - 16:25:05 PST