Subject: Re: [htdig] parse_doc.pl split word with accents
From: Gilles Detillieux (grdetil@scrc.umanitoba.ca)
Date: Mon May 29 2000 - 08:38:29 PDT
According to Andoni Ayala:
> When i trying to parse doc (pdf, wordperfect, etc), i parse it with
> parse_doc.pl, the script split the accented word in two. but if i parse
> directly the document with de particular parser (ej wp2html, or
> pdftohtml) i view well the accents.
Are you sure it's the parse_doc.pl script, and not htdig, that's splitting
the words? Do you have your locale set correctly? See
http://www.htdig.org/FAQ.html#q4.9
http://www.htdig.org/FAQ.html#q4.10
http://www.htdig.org/FAQ.html#q5.8
You should probably also use an external converter, such as conv_doc.pl or
better yet, doc2html, as you'll get better results than with parse_doc.pl.
The doc2html converter also makes it easier to add other conversion
programs.
-- Gilles R. Detillieux E-mail: <grdetil@scrc.umanitoba.ca> Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/~grdetil Dept. Physiology, U. of Manitoba Phone: (204)789-3766 Winnipeg, MB R3E 3J7 (Canada) Fax: (204)789-3930------------------------------------ To unsubscribe from the htdig mailing list, send a message to htdig-unsubscribe@htdig.org You will receive a message to confirm this.
This archive was generated by hypermail 2b28 : Mon May 29 2000 - 06:28:12 PDT