Re: [htdig] htdig / Suse 6.2: very long run ?


Subject: Re: [htdig] htdig / Suse 6.2: very long run ?
From: Torsten Neuer (tneuer@inwise.de)
Date: Wed Apr 26 2000 - 08:55:14 PDT


Geoff Hutchison wrote:
>
> At 2:15 PM +0300 4/26/00, Peter L. Peres wrote:
> > I's me again ;-) Has anyone tried to index a C/java/C++/ASM source tree
> >using htdig ? Perhaps by placing a list of menemonics and reserved words
> >in the bad word list ?

For C/C++/Java it should be quite easy to write a lex/yacc parser which
eliminates reserved words, operators and other "noise" characters. In
addition, such a parser could globally declared functions and variables
to <H> tags.

There should be some source->html converters somewhere at freshmeat,
which
already do some nice markup. Either plugging such a converter into the
web-server for converting plain source files on-the-fly or having such
a tool (perhaps with little modifications) generate input for the digger
should be no problem.

> > Is there some support for parsing dvi and ps files ? dvi can be turned
> >into (ugly) text using dvi2ascii and there is a corresponding converter
> >for ps.
>
> I would check the conv_doc.pl script and plug in a dvi->txt
> converter. I believe it already handles PostScript files nicely.

Perhaps it is easier (and better, although slower) to convert dvi->ps
and use the PostScript feature of conv_doc.pl - dvi2ascii and similar
might lead to some unwanted effects with regards to embedded graphics,
which probably cause a lot of noise in the document database (the
excerpts will contain lots of dashes/vertical bars etc for rulers).

cheers,
  Torsten

-- 
InWise - Wirtschaftlich-Wissenschaftlicher Internet Service GmbH
Waldhofstraße 14                            Tel: +49-4101-403605
D-25474 Ellerbek                            Fax: +49-4101-403606
E-Mail: info@inwise.de            Internet: http://www.inwise.de

------------------------------------ To unsubscribe from the htdig mailing list, send a message to htdig-unsubscribe@htdig.org You will receive a message to confirm this.



This archive was generated by hypermail 2b28 : Wed Apr 26 2000 - 06:42:31 PDT