Re: [htdig] PDFs, numbers, and percent signs


Subject: Re: [htdig] PDFs, numbers, and percent signs
From: Gilles Detillieux (grdetil@scrc.umanitoba.ca)
Date: Wed Jan 10 2001 - 13:08:37 PST


According to Philip E. Varner:
> 1) The directive minimum_word_length defaults to 3, but when dealing with
> two-digit numbers, this should be set to two. The default would catch
> "25%", but not other numbers. This needs to be set in htdig.conf, AND in
> parse_doc.pl, if using it. parse_doc.pl should probably be changed to
> read variables from htdig.conf at some point in time, but that's not my
> call.
>
> 2) In additon to minimum_word_length, I added these attributes to
> htdig.conf
>
> allow_numbers: true
> extra_word_characters: %$&#
> valid_punctuation: .-_/!^'
>
> By default, htdig ignores numbers, so I set it count them. It also
> ignores most punctuation, so I allow the characters %$&# since they are
> common pre/suffixes for numbers. valid_punctuation then says what to
> ignore. Also, these need to be accounted for in parse_doc.pl.
>
> 3) The default for parse_doc.pl is to strip all punctuation, with the
> command
>
> tr{-\255._/!#$%^&'}{}d;
>
> I changed this to
>
> tr{-\255._/!^'}{}d;
>
> to leave the punctuation I wanted. However, this punctuation was still
> deleted because of the way the text is split() into and array. I changed
> the command
>
> push @allwords, grep { length >= $minimum_word_length } split/\W+/;
>
> to
>
> push @allwords, grep { length >= $minimum_word_length } split /\s+/;
>
> \W matches anything that's not a word, which includes punctuation. So,
> punctuation was still getting stripped out. \s matches all whitespace,
> which is what I really want, since all "offending" punctuation was removed
> earlier. This works for me, but might not work for everyone.
>
> 4) I increased the limit on these two attributes, since PDF are larger, I
> only had a few dozen, and I wanted good matches. This is probably not a
> good idea if you have a lot of files, though.
>
> max_head_length: 500000
> max_doc_size: 50000000
>
>
> If anyone has any other suggestions, I'd like to hear about them.

Most of the problems you ran into could have easily been avoided if you
tossed parse_doc.pl into the bit bucket and used an external converter
like doc2html.pl or conv_doc.pl instead.

As you realised, external parsers don't read your config file attributes,
and it would mean making them extremely big and complicated, with a lot
of duplication of code, to get them to do this properly. That's why
external parsers, in most cases, are a bad idea. That's also why I
added external converter support back in version 3.1.4. That way, you
just need a simple conversion to plain text or HTML, and all the gory
details of parsing the document in accordance with the users wishes are
handled internally by the text or HTML parser.

So, no, parse_doc.pl should not be changed to read the htdig.conf
attributes. It should be given a decent burial and forgotten.

-- 
Gilles R. Detillieux              E-mail: <grdetil@scrc.umanitoba.ca>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:    (204)789-3930

------------------------------------ To unsubscribe from the htdig mailing list, send a message to htdig-unsubscribe@htdig.org You will receive a message to confirm this. List archives: <http://www.htdig.org/mail/menu.html> FAQ: <http://www.htdig.org/FAQ.html>



This archive was generated by hypermail 2b28 : Wed Jan 10 2001 - 13:22:30 PST