Re: htdig: External document parser??

Michael J. Long (
Tue, 30 Sep 1997 14:40:45 -0400

Markus wrote:


> > > so what about the aproach to put all parsers into a dynamic
> > > loadable library?
> >
> > What if each module would be a seperate dynamic library that
> > htdig would load at startup? Sort of like how Netscape deals
> > with plug-ins. For those of you not familiar, NS searches
> > a directory path and queries all plug-ins in those paths for
> > the MIME type that they handle. NS then registers the plug-in
> > to handle any file that matches that MIME type.
> >
> > This would work perfectly for htdig. That way, instead of
> > having to alter the source code (Document::getParsable) every
> > time a new module is created, htdig could get this information
> > dynamically from the "plug-ins" themselves.
> >
> > I will be more than happy to help adding this functionality.
> >
> > > this library would have to export a certain set of funktions
> > > (eg getTitle(file) ). if the requestet type of information can
> > > be extracted by the parser, the library deliveres it.
> >
> > Sounds like Markus is describing the exact same thing.
> >
> > > one would have to supply an additional configuration directive
> > > to associate a file type with a library like mime types.
> >
> > Well, maybe he is thinking similar but not the same. :^)
> your approach seems more comfortable but what if someone wants
> to have eg phtml file treated like html or has the server

Remember, htdig acts as a web client and determines the parser
to use according to MIME type, not extension. If the server
claims .phtml files are of the type "text/html", then that
is what htdig uses.

> configured to tread any html file as phtml and want only
> '*.pure' files parsed by htdig? i think there is a need to

Couldn't you just add .html and .phtml to the exclude_urls
configuration option in htdig.conf?

> associate file types with parser libs in a convinient way.

If you want to put it in a configuration file, that is another
possibility. But shouldn't the plug-in know what MIME types it
can handle? What if the user tries to use the text
module to parse a Frame (or Word file)? The index will be
filled with garbage, wouldn't it? I'm not sure, I'm guessing.

> > > the most generous approach would be using the nss interface of
> > > glibc2 that would make it a service available to any software
> > > on that machine.
> >
> > Unfortunately, I am not familiar with nss. Could anyone guide
> > me to some documentation/information??
> that would be really really great because it would solve the
> configuration problem by reading /etc/nss.config and one could
> have something like 'htgrep' to use at the command line.
> one could even think of implementing more skillfull infosystems
> at the command line or whereever by just calling a function from
> libnsssomeservice which will be available by only linking against
> libc6. using nss would give us a great deal in 'shell integration'.
> a very usefull feature would be that information generated by
> htdig could be easily made available to any web relevant scripting
> language.
> documentation is available in the texinfo files that come
> with the libc6 distribution.
> the latest version is glibc 2.0.5 it is available from

I will have to look at the documentation before commenting but
does libc6 replace libc? And if it does, would that mean libc
would have to be replaced on the system to compile and run
htdig? If it does, then I don't think that will fly.


> what is a frame file?

Adobe FrameMaker, a Desktop Publishing tool. FrameMaker was
actually created by Frame Technologies, which was bought by
Adobe. A competing product is InterLeaf.

[...sig snipped...]

Michael J. Long

* Michael J. Long * #include <disclaimer.h>
*   Summa Four    * Work: mjlong@Summa4.COM
* Manchester, NH  * Play:
To unsubscribe from the htdig mailing list, send a message to containing the single word "unsubscribe" in
the body of the message.

This archive was generated by hypermail 2.0b3 on Sat Jan 02 1999 - 16:25:06 PST