Re: [htdig] PDF indexing problem


Subject: Re: [htdig] PDF indexing problem
From: Gilles Detillieux (grdetil@scrc.umanitoba.ca)
Date: Mon Nov 29 1999 - 11:19:45 PST


According to J. op den Brouw:
> I'm using Acroread 3.02 for HP-UX to index .pdf files. It seems to
> work alright, but when htmerge starts, a lot of words seem to be
> "glued" together.
...
> htmerge: Merging...
> htmerge: 100:abilit
> htmerge: 200:alsaresho
> htmerge: 300:andmak
> htmerge: 400:arenotjoin
> htmerge: 500:atfunction

I ran into the same problem myself. I added a fix for this back in 3.1.2,
but it only handled some of the situations that caused this, not all. Some
applications do some wacky stuff with character spacing to achieve word
spacing, and that leads to some really ugly PostScript output from acroread,
which PDF.cc just can't make good sense out of. You'd probably have better
luck using pdftotext, from the xpdf 0.90 package, in an external parser.

See http://www.htdig.org/FAQ.html#q4.9 for details. Wasn't this parse_doc.pl
script originally yours?

-- 
Gilles R. Detillieux              E-mail: <grdetil@scrc.umanitoba.ca>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:    (204)789-3930

------------------------------------ To unsubscribe from the htdig mailing list, send a message to htdig-unsubscribe@htdig.org You will receive a message to confirm this.



This archive was generated by hypermail 2b25 : Mon Nov 29 1999 - 11:33:20 PST