[htdig] Re: doc_parser.pl

Subject: [htdig] Re: doc_parser.pl
From: Gilles Detillieux (grdetil@scrc.umanitoba.ca)
Date: Wed Aug 30 2000 - 09:21:20 PDT

According to Benjelloun Adnane:
> to make doc_parser.pl to work with accents please change this line :
> push @allwords, grep { length >= $minimum_word_length } split /\W+/;
> to :
> push @allwords, grep { length >= $minimum_word_length } split
> /[^a-zA-Z]+/;

Or much better still, dump the old external parser, and switch to an
external converter like conv_doc.pl or doc2html.pl. There's no reason to
support parse_doc.pl any longer. It's been hacked too many times by too
many users with too many conflicting needs, and never did give results
that are consistent with the internal parsers. An external converter
will, because it defers the parsing to the internal parsers.

