Re: [htdig3-dev] Current Status as of snapshot 3.2.0b1-020600


Subject: Re: [htdig3-dev] Current Status as of snapshot 3.2.0b1-020600
From: Gilles Detillieux (grdetil@scrc.umanitoba.ca)
Date: Wed Feb 09 2000 - 13:54:44 PST


According to Gilles Detillieux:
> According to Geoff Hutchison:
> > Are there any other attributes which are (or should) be defunct in
> > the next release? Since word_list isn't used in the code, I think it
> > should go, unless we want to change word_dump to word_list (they have
> > remarkably similar function). I'm also wondering if we shouldn't kill
> > pdf_parser?
>
> I'd be in favor of killing pdf_parser, but it would be nice to have
> an acro2text utility to convert acroread's PS output to plain text, as
> PDF.cc did, so that if anyone is still set on using acroread, they'll
> be able to do so through an external converter.

OK, I think this Perl script will do the trick. I invite any Perl
hackers out there to hack away at it. I don't think I'll bother with
a C implementation unless this script just won't do. As far as I'm
concerned, we can probably now kill off the pdf_parser attribute, and
all the headaches associated with it. I just hope the script won't be
the source of more headeaches! :-P

#!/usr/local/bin/perl
#
# Sample external converter for htdig 3.1.4 or later, to convert PDFs using
# Adobe Acrobat 3's acroread -toPostScript option on UNIX systems.
# (Use it in place of conv_doc.pl if you have acroread but not pdftotext.)
#
# Usage: (in htdig.conf)
#
# external_parsers: application/pdf->text/html /usr/local/bin/acroconv.pl
#
# This is a pretty quick and dirty implementation, but it does seem to
# give functionality equivalent to the soon-to-be eliminated htdig/PDF.cc
# parser. I'm not a Perl expert by any stretch of the imagination, so the
# code could probably use a lot of optimization to make it work better.
#

$watch = 0;
$bigspace = 0;
$putspace = 0;
$putbody = 1;

system("ln $ARGV[0] $ARGV[0].pdf; acroread -toPostScript $ARGV[0].pdf");
open(INP, "< $ARGV[0].ps") || die "Can't open $ARGV[0].ps\n";

print "<HTML>\n<head>\n";
while (<INP>) {
        if (/^%%Title: / && $putbody) {
                s/^%%Title: \((.*)\).*\n/$1/;
                s/\\222/'/g;
                s/\\267/*/g;
                s/\\336/fi/g;
                s/\\([0-7]{1,2,3})/pack(C, oct($1))/eig;
                s/\\[nrtbf]/ /g;
                s/\\(.)/$1/g;
                s/&/\&amp\;/g;
                s/</\&lt\;/g;
                s/>/\&gt\;/g;
                print "<title>$_</title>\n";
                print "</head>\n<body>\n";
                $putbody = 0;
        } elsif (/^BT/) {
                $watch = 1;
        } elsif (/^ET/) {
                $watch = 0;
                if ($putspace) {
                        print "\n";
                        $putspace = 0;
                }
        } elsif ($watch) {
                if (/T[Jj]$/) {
                        s/\)[^(]*\(//g;
                        s/^[^(]*\((.*)\).*\n/$1/;
                        s/\\222/'/g;
                        s/\\267/*/g;
                        s/\\336/fi/g;
                        s/\\([0-7]{1,2,3})/pack(C, oct($1))/eig;
                        s/\\[nrtbf]/ /g;
                        s/\\(.)/$1/g;
                        if ($bigspace) {
                                s/(.)/$1 /g;
                        }
                        s/&/\&amp\;/g;
                        s/</\&lt\;/g;
                        s/>/\&gt\;/g;
                        if ($putbody) {
                                print "</head>\n<body>\n";
                                $putbody = 0;
                        }
                        print "$_";
                        $putspace = 1;
                } elsif (/T[Ddm*]$/ && $putspace) {
                        print "\n";
                        $putspace = 0;
                } elsif (/Tc$/) {
                        $bigspace = 0;
                        if (/^([3-9]|[1-9][0-9]+)\..*Tc$/) {
                                $bigspace = 1;
                        }
                }
        }
}
if ($putbody) {
        print "</head>\n<body>\n";
}
print "</body>\n</HTML>\n";
system("rm -f $ARGV[0].pdf $ARGV[0].ps");

-- 
Gilles R. Detillieux              E-mail: <grdetil@scrc.umanitoba.ca>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:    (204)789-3930

------------------------------------ To unsubscribe from the htdig3-dev mailing list, send a message to htdig3-dev-unsubscribe@htdig.org You will receive a message to confirm this.



This archive was generated by hypermail 2b28 : Wed Feb 09 2000 - 13:57:15 PST