Gilles Detillieux (grdetil@scrc.umanitoba.ca)
Wed, 12 May 1999 15:28:13 -0500 (CDT)
According to Sunil Jagarlamudi:
> I am using xpdf and pdftotext to index the files, and I have noticed that
> htdig ignores the numbers in it's db.wordlist and also when I try to search
> for something like 078-001, it comes back with nothing found for
> 078001. I tried to do the search in quotes as well, but to no avail. Is there
> anyway we can let htdig keep an index of the numbers in the files as well and
> also numbers with - and _ with them ?
If you're using parse_doc.pl as an external parser for application/pdf files
(it uses pdftotext to convert the PDF files), you'll need to remove or
comment out this line (about line # 138):
s/[\-\255]/ /g; # replace hyphens with space
Then, as Geoff suggested, you'll need to set the allow_numbers and
extra_word_characters attributes in your htdig.conf file. You'll need to
reindex after that. The extra_word_characters attribute was introduced
in 3.1.2.
Underscore characters shouldn't pose a problem in any case.
The hyphens within numbers could pose further problem for you if, in
the PDF files, there are line breaks at the hypens, like this: 078-
001. There's code in parse_doc.pl to dehyphenate words, but not to
rejoin hyphenated numbers. You'd need to modify the code to handle
these, if it turns out to be a problem for you. Of course, for numbers,
you'd need to keep the hyphen in when you rejoing them.
-- Gilles R. Detillieux E-mail: <grdetil@scrc.umanitoba.ca> Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/~grdetil Dept. Physiology, U. of Manitoba Phone: (204)789-3766 Winnipeg, MB R3E 3J7 (Canada) Fax: (204)789-3930 ------------------------------------ To unsubscribe from the htdig mailing list, send a message to htdig@htdig.org containing the single word "unsubscribe" in the SUBJECT of the message.
This archive was generated by hypermail 2.0b3 on Wed May 12 1999 - 13:38:31 PDT