Re: [htdig] PDFs, numbers, and percent signs


Subject: Re: [htdig] PDFs, numbers, and percent signs
From: Philip E. Varner (pev5b@cs.virginia.edu)
Date: Wed Jan 10 2001 - 11:09:20 PST


I figured out the problem and solution. In case anyone else has this
problem in the future, here are a few of the gotchas.

A Description of My Original Problem

We have a bunch of PDF files that are the minutes from committee meetings.
The members want to be able to search them, so htdig was the natural
choice. One person remembered something about "25%" from a meeting, but
was unsure of which one, so they tried the search engine. But, it didn't
return any results. So, I poked around and found it was a combination of
things. Here's what they are.

Solutions:

Below are things I did in addition to the directions on htdig.org for
indexing PDF files.

1) The directive minimum_word_length defaults to 3, but when dealing with
two-digit numbers, this should be set to two. The default would catch
"25%", but not other numbers. This needs to be set in htdig.conf, AND in
parse_doc.pl, if using it. parse_doc.pl should probably be changed to
read variables from htdig.conf at some point in time, but that's not my
call.

2) In additon to minimum_word_length, I added these attributes to
htdig.conf

allow_numbers: true
extra_word_characters: %$&#
valid_punctuation: .-_/!^'

By default, htdig ignores numbers, so I set it count them. It also
ignores most punctuation, so I allow the characters %$&# since they are
common pre/suffixes for numbers. valid_punctuation then says what to
ignore. Also, these need to be accounted for in parse_doc.pl.

3) The default for parse_doc.pl is to strip all punctuation, with the
command

tr{-\255._/!#$%^&'}{}d;

I changed this to

tr{-\255._/!^'}{}d;

to leave the punctuation I wanted. However, this punctuation was still
deleted because of the way the text is split() into and array. I changed
the command

push @allwords, grep { length >= $minimum_word_length } split/\W+/;

to

push @allwords, grep { length >= $minimum_word_length } split /\s+/;

\W matches anything that's not a word, which includes punctuation. So,
punctuation was still getting stripped out. \s matches all whitespace,
which is what I really want, since all "offending" punctuation was removed
earlier. This works for me, but might not work for everyone.

4) I increased the limit on these two attributes, since PDF are larger, I
only had a few dozen, and I wanted good matches. This is probably not a
good idea if you have a lot of files, though.

max_head_length: 500000
max_doc_size: 50000000

If anyone has any other suggestions, I'd like to hear about them.

Phil Varner

--
A distributed system is one in which the failure of a computer you
didn't even know existed can render your own computer unusable.
-- Leslie Lamport

------------------------------------ To unsubscribe from the htdig mailing list, send a message to htdig-unsubscribe@htdig.org You will receive a message to confirm this. List archives: <http://www.htdig.org/mail/menu.html> FAQ: <http://www.htdig.org/FAQ.html>



This archive was generated by hypermail 2b28 : Wed Jan 10 2001 - 11:23:14 PST