Re: htdig: External document parser??


Markus (markus@dom.de)
Tue, 30 Sep 1997 17:33:30 +0100


----------
> From: Michael J. Long <mjlong@Summa4.COM>
> To: Markus <markus@dom.de>
> Cc: htdig@sdsu.edu
> Subject: Re: htdig: External document parser??
> Date: Dienstag, 30. September 1997 14:22
>
> Markus wrote:
> >
> > if i got the problem right.
> >
> > you want htdig be able to parse eg. gif files and extract the
> > relevant information to store in its datbae?
> > this you want to have done by a somehow seperate module?
>
> Yep. Looking at the source code, this is what is done now.
> There is a seperate module for text and another for postscript.
> For the GIF files, a module could index the comments in the
> GIF. Of course, I know very few people that actually use that
> feature of GIF.
>
> > that would be really great!!!!!!
>
> I know. A very intelligent move by the htdig folks.
>
> > so what about the aproach to put all parsers into a dynamic
> > loadable library?
>
> What if each module would be a seperate dynamic library that
> htdig would load at startup? Sort of like how Netscape deals
> with plug-ins. For those of you not familiar, NS searches
> a directory path and queries all plug-ins in those paths for
> the MIME type that they handle. NS then registers the plug-in
> to handle any file that matches that MIME type.
>
> This would work perfectly for htdig. That way, instead of
> having to alter the source code (Document::getParsable) every
> time a new module is created, htdig could get this information
> dynamically from the "plug-ins" themselves.
>
> I will be more than happy to help adding this functionality.
>
> > this library would have to export a certain set of funktions
> > (eg getTitle(file) ). if the requestet type of information can
> > be extracted by the parser, the library deliveres it.
>
> Sounds like Markus is describing the exact same thing.
>
> > one would have to supply an additional configuration directive
> > to associate a file type with a library like mime types.
>
> Well, maybe he is thinking similar but not the same. :^)

your approach seems more comfortable but what if someone wants to have eg
phtml file treated like html or has the server configured to tread any html
file as phtml and want only '*.pure' files parsed by htdig?
i think there is a need to associate file types with parser libs in a
convinient way.

>
> > the most generous approach would be using the nss interface of
> > glibc2 that would make it a service available to any software
> > on that machine.
>
> Unfortunately, I am not familiar with nss. Could anyone guide
> me to some documentation/information??

that would be really really great because it would solve the configuration
problem by reading /etc/nss.config and one could have something like
'htgrep' to use at the command line.
one could even think of implementing more skillfull infosystems at the
command line or whereever by just calling a function from libnsssomeservice
which will be available by only linking against libc6.
using nss would give us a great deal in 'shell integration'. a very usefull
feature would be that information generated by htdig could be easily made
available to any web relevant scripting language.

documentation is available in the texinfo files that come with the libc6
distribution.
the latest version is glibc 2.0.5 it is available from
ftp://alpha.gnu.ai.mit.edu/gnu/libc

>
> > everybody in the world would have to supply such a library for
> > his file formats :) that would cause adobe a lot of work.
>
> If life were perfect, I would agree, but we all know what life
> is, or rather is not.
>
> Cause Adobe a lot of work?? I doubt it. They most likely already
> have a PDF (or PS, or Frame, or Pagemaker, etc.) to text converter
> written. It would just be a matter of passing that text to the
> text parser that is already included with htdig.
>
> While I am thinking of it, I am thinking about writing a module
> for Frame files and I need a little (theoretical) help. What I
> was thinking of doing was calling an external program (fmbatch)
> to convert the Frame file to text and then have the text parser
> parse the text file. That way, I can avoid the work of having to
> decode the Frame file and I also wouldn't have to write a text
> parser all over again. Is this possible??

what is a frame file?

markus

>
> [...snip...]
>
> Michael J. Long
>
> P.S. Thanks for such a great tool. All we need for it is a
> programmers reference.
>
> --
> * Michael J. Long * #include <disclaimer.h>
> * Summa Four * Work: mjlong@Summa4.COM
> * Manchester, NH * Play: mjlong@mindspring.com
----------------------------------------------------------------------
To unsubscribe from the htdig mailing list, send a message to
htdig-request@sdsu.edu containing the single word "unsubscribe" in
the body of the message.



This archive was generated by hypermail 2.0b3 on Sat Jan 02 1999 - 16:25:05 PST