Re: htdig: Digging MS Word 97


J. op den Brouw (MSQL_User@st.hhs.nl)
Thu, 12 Nov 1998 14:54:29 +0100


Here is a Perl script that uses the catdoc program (V 0.90).
Download catdoc stuff from URL below, untargz, ./configure
etc. Set $CATDOC to catdoc proggie.

Set external_parsers to something like:

external_parsers: application/msword
/usr/local/htdig/external_parsers/bin/parse_word_doc.pl

And it should run. Note that catdoc is beta release and sometimes fails
to
parse Word doc. This Perl script takes a long time on large Word
files....

--jesse

#!/usr/local/gnu/bin/perl

#########################################
#
# set this to your catdoc proggie
#
# get it from: http://www.fe.msk.ru/~vitus/catdoc/
#
$CATDOC = "/usr/local/htdig/external_parsers/catdoc/bin/catdoc";

# need some var's
#empty array
@allwords = ();
$x = 0;
$line = "";
@fields = ();
$calc = 0;

#
# okay. my programming style isn't that nice, but it works...

#for ($x=0; $x<@ARGV; $x++) {
# print STDERR "$ARGV[$x]\n";
#}

open(CAT, "$CATDOC -a -w $ARGV[0] |") || die "Hmmm. Something is
wrong.\n";
while ($line = <CAT>) {
        @fields = split(/\s+/,$line);
        for ($x=0; $x<@fields; $x++) {
                if ($fields[$x] =~ /\w/) {
                        @allwords = (@allwords, $fields[$x]);
                }
        }
}

close CAT;

#############################################
# print out the title
print "t\tWord Document $ARGV[2]\n";

#############################################
# print out the head
$calc = @allwords;
print "h\t";
#if ($calc >100) { # but not more than 100 words
# $calc = 100;
#}
for ($x=0; $x<$calc; $x++) {
        print "$allwords[$x] ";
}

#############################################
# now the words
for ($x=0; $x<@allwords; $x++) {
        $calc=int(1000*$x/@allwords); # calculate rel.
position (0-1000)
        print "w\t$allwords[$x]\t$calc\t0\n"; # print out word, rel.
pos. and text type (0)
}

BLKA.DEZ54 wrote:
>
> Who give me a example to embed an wordviewer into htdig (htparsedoc or
> else)
>
> Thanks
> ----------------------------------------------------------------------
> To unsubscribe from the htdig mailing list, send a message to
> htdig-request@sdsu.edu containing the single word "unsubscribe" in
> the body of the message.
----------------------------------------------------------------------
To unsubscribe from the htdig mailing list, send a message to
htdig-request@sdsu.edu containing the single word "unsubscribe" in
the body of the message.



This archive was generated by hypermail 2.0b3 on Sat Jan 02 1999 - 16:28:47 PST