Re: [htdig] PDF & ISO-Latin chars


Antti Rauramo (antti.rauramo@edita.fi)
Fri, 13 Aug 1999 14:31:27 +0300


Hello Gilles!

> On the other hand, pdftotext (part of the xpdf package) seems to handle

[...]

Yep, great, using xpdf's pdftotext helped! Now also searching pdf's works
flawlessly! Thank you!

> You may want to adapt the script to extract titles from PDFs using
> pdfinfo, if the titles matter to you. (That's something on my to-do
> list I can't seem to find the time for.)

Oop, heh, didn't read up to here before already adding a part to parse_doc.pl which
reads the pdf and finds the title. (Though this may have problems w/ crypted pdf's)
Here's the cut beginning around line 152...

#############################################
# print out the title
#@temp = split(/\//, $ARGV[2]); # get the filename, get rid of basename
#print "t\t$type Document $temp[-1]\n"; # print it

### 13-08-1999 ant
open(TITLEIN,"<$ARGV[0]") || print STDERR "$ARGV[0]: $!\n";
while(<TITLEIN>){
  if(/title/i){
    ($pdftitle)=/\/Title \(([^\/)]+)[\/\)]/i;
    $pdftitle && close TITLEIN;
  }
}
close TITLEIN;

$pdftitle=~s/\\(\d\d\d)/pack(c,oct($1))/ge;
if(!$pdftitle){ $pdftitle="$type Document $temp[-1]"; }
print "t\t$pdftitle\n";

--
- Antti Rauramo, WWW- ja tietokanta-asiantuntija, Edita Verkkoviestintä
- antti.rauramo@edita.fi, +358-9-8501 4004 (mobile)

------------------------------------ To unsubscribe from the htdig mailing list, send a message to htdig@htdig.org containing the single word unsubscribe in the SUBJECT of the message.



This archive was generated by hypermail 2.0b3 on Fri Aug 13 1999 - 04:39:27 PDT