Re: [htdig] PDF indexing problem


Subject: Re: [htdig] PDF indexing problem
From: Gilles Detillieux (grdetil@scrc.umanitoba.ca)
Date: Wed Aug 09 2000 - 13:53:32 PDT


According to Justin Hopkins:
> What exactly has been unsatisfactory about acroread? I'm curious to
> hear your (and anyone else's) experience to decide if I'll switch over
> to doc2html.

My biggest beef, apart from the fact it was slow, was that the character
spacing tended to throw things off for word separation, so that some words
ended up concatenated, or more often words were broken up with a space.
I had patched PDF.cc to correct the more flagrant cases, but quite a
few still slipped through, and they weren't easy to deal with.

I also have a strong preference for external converters (doc2html or
conv_doc) over external parsers (e.g. parse_doc), because the parsing is
done in a way that's consistent with the internal parsers, and uses the
config attributes you set for punctuation and such. parse_doc tended to
handle many characters differently, so it got patched several times over
by several users to fix various problems, and it still wasn't quite right.

-- 
Gilles R. Detillieux              E-mail: <grdetil@scrc.umanitoba.ca>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:    (204)789-3930

------------------------------------ To unsubscribe from the htdig mailing list, send a message to htdig-unsubscribe@htdig.org You will receive a message to confirm this.



This archive was generated by hypermail 2b28 : Wed Aug 09 2000 - 03:53:11 PDT