[htdig] Re: doc_parser.pl


Subject: [htdig] Re: doc_parser.pl
From: Gilles Detillieux (grdetil@scrc.umanitoba.ca)
Date: Wed Aug 30 2000 - 09:21:20 PDT


According to Benjelloun Adnane:
> to make doc_parser.pl to work with accents please change this line :
>
> push @allwords, grep { length >= $minimum_word_length } split /\W+/;
>
> to :
>
> push @allwords, grep { length >= $minimum_word_length } split
> /[^a-zA-Z]+/;

Or much better still, dump the old external parser, and switch to an
external converter like conv_doc.pl or doc2html.pl. There's no reason to
support parse_doc.pl any longer. It's been hacked too many times by too
many users with too many conflicting needs, and never did give results
that are consistent with the internal parsers. An external converter
will, because it defers the parsing to the internal parsers.

-- 
Gilles R. Detillieux              E-mail: <grdetil@scrc.umanitoba.ca>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:    (204)789-3930

------------------------------------ To unsubscribe from the htdig mailing list, send a message to htdig-unsubscribe@htdig.org You will receive a message to confirm this. List archives: <http://www.htdig.org/mail/menu.html> FAQ: <http://www.htdig.org/FAQ.html>



This archive was generated by hypermail 2b28 : Wed Aug 30 2000 - 09:22:33 PDT