Gilles Detillieux (grdetil@scrc.umanitoba.ca)
Tue, 9 Feb 1999 18:14:09 -0600 (CST)
According to me:
> I'd recommend not reinventing the wheel. Instead of a builtin parser,
> it would make a lot more sense to build an external parser around
> ghostscript. Its "ps2ascii" program, which is just a script that calls gs
> with specific options, would be a good starting point. You could modify a
> script like contrib/htparsedoc/parse_word_doc.pl to use ps2ascii instead
> of catdoc, and change the title it spits out. That would probably make
> a decent external PostScript parser. I haven't tried it, though.
OK, so I can't turn down a challenge. I had to poke around with the
parse_word_doc.pl script anyway, to test the catdoc problem Jesse had
reported, so I decided to enhance the script to handle PostScript files
too.
In the process, I found a small bug in ExternaParser.cc - it didn't remove
the temporary file it uses. Here's the patch for that:
--- ./htdig/ExternalParser.cc.noremove Mon Feb 1 13:46:23 1999
+++ ./htdig/ExternalParser.cc Tue Feb 9 17:56:18 1999
@@ -139,6 +139,7 @@
FILE *input = popen(command, "r");
if (!input)
{
+ unlink(path);
return;
}
@@ -335,6 +336,7 @@
}
}
pclose(input);
+ unlink(path);
}
And here is my new parse_word_or_ps_doc.pl script. OK, a shorter
name is in order. By the way, the original parse_word_doc.pl in
contrib/htparsedoc got messed up - all the long lines are folded, which
perl really didn't like! Be careful that your mail program doesn't do
the same to this one, if you're going to use it. As you can see, this
script could be easily extended to handle any number of "something" to
text converters, as long as the file command can determine what the
file type is.
--------------------- (snip) ---------------------
#!/usr/local/bin/perl
# 1998/12/10
# Added: push @allwords, $fields[$x]; <carl@dpiwe.tas.gov.au>
# Replaced: matching patterns. they match words starting or ending with ()[]'`;:?.,! now, not when in between!
# Gone: the variable $line is gone (using $_ now)
#
# 1998/12/11
# Added: catdoc test (is catdoc runnable?) <carl@dpiwe.tas.gov.au>
# Changed: push line semi-colomn wrong. <carl@dpiwe.tas.gov.au>
# Changed: matching works for end of lines now <carl@dpiwe.tas.gov.au>
# Added: option to rigorously delete all punctuation <carl@dpiwe.tas.gov.au>
# 1999/02/09
# Added: option to delete all hyphens <grdetil@scrc.umanitoba.ca>
# Changed: uses ps2ascii to handle PS files <grdetil@scrc.umanitoba.ca>
#########################################
#
# set this to your catdoc proggie
#
# get it from: http://www.fe.msk.ru/~vitus/catdoc/
#
$CATDOC = "/usr/local/bin/catdoc";
#
# set this to your PostScript to text converter
# get it from the ghostscript 3.33 (or later) package
#
$CATPS = "/usr/bin/ps2ascii";
# need some var's
@allwords = ();
@temp = ();
$x = 0;
@fields = ();
$calc = 0;
#
# okay. my programming style isn't that nice, but it works...
#for ($x=0; $x<@ARGV; $x++) { # print out the args
# print STDERR "$ARGV[$x]\n";
#}
open(FILE, "file $ARGV[0] |") || die "Hmmm. Can't determine file type.\n";
if (<FILE> =~ /:\s*PostScript/) {
$parse = "(cd /tmp; $CATPS; rm -f _temp_.???) < $ARGV[0] |";
$type = "PostScript";
die "Hmm. ps2ascii is absent or unwilling to execute.\n" unless -x $CATPS;
} else {
$parse = "$CATDOC -a -w $ARGV[0] |";
$type = "Word";
die "Hmm. catdoc is absent or unwilling to execute.\n" unless -x $CATDOC;
}
close FILE;
#
# open it
open(CAT, "$parse") || die "Hmmm. parser doesn't want to be opened using pipe.\n";
while (<CAT>) {
s/\s[\(\)\[\]\\\/\^\;\:\"\'\`\.\,\?!\*]|[\(\)\[\]\\\/\^\;\:\"\'\`\.\,\?!\*]\s|^[\(\)\[\]\\\/\^\;\:\"\'\`\.\,\?!\*]|[\(\)\[\]\\\/\^\;\:\"\'\`\.\,\?!\*]$/ /g; # replace reading-chars with space (only at end or begin of word)
# s/[\(\)\[\]\\\/\^\;\:\"\'\`\.\,\?!\*]/ /g; # rigorously replace all by <carl@dpiwe.tas.gov.au>
s/-/ /g; # replace hyphens with space
@fields = split; # split up line
next if (@fields == 0); # skip if no fields (does it speed up?)
for ($x=0; $x<@fields; $x++) { # check each field if string length > 3
if (length($fields[$x]) > 3) {
push @allwords, $fields[$x]; # add to list
}
}
}
close CAT;
#############################################
# print out the title
@temp = split(/\//, $ARGV[2]); # get the filename, get rid of basename
print "t\t$type Document $temp[-1]\n"; # print it
#############################################
# print out the head
$calc = @allwords;
print "h\t";
#if ($calc >100) { # but not more than 100 words
# $calc = 100;
#}
for ($x=0; $x<$calc; $x++) { # print out the words for the exerpt
print "$allwords[$x] ";
}
print "\n";
#############################################
# now the words
for ($x=0; $x<@allwords; $x++) {
$calc=int(1000*$x/@allwords); # calculate rel. position (0-1000)
print "w\t$allwords[$x]\t$calc\t0\n"; # print out word, rel. pos. and text type (0)
}
$calc=@allwords;
#print STDERR "# of words indexed: $calc\n";
--------------------- (snip) ---------------------
-- Gilles R. Detillieux E-mail: <grdetil@scrc.umanitoba.ca> Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/~grdetil Dept. Physiology, U. of Manitoba Phone: (204)789-3766 Winnipeg, MB R3E 3J7 (Canada) Fax: (204)789-3930 ------------------------------------ To unsubscribe from the htdig3-dev mailing list, send a message to htdig3-dev@htdig.org containing the single word "unsubscribe" in the SUBJECT of the message.
This archive was generated by hypermail 2.0b3 on Tue Feb 09 1999 - 16:40:09 PST