Re: [htdig] Newbie question regarding htdig


Gilles Detillieux (grdetil@scrc.umanitoba.ca)
Wed, 12 May 1999 15:28:13 -0500 (CDT)


According to Sunil Jagarlamudi:
> I am using xpdf and pdftotext to index the files, and I have noticed that
> htdig ignores the numbers in it's db.wordlist and also when I try to search
> for something like 078-001, it comes back with nothing found for
> 078001. I tried to do the search in quotes as well, but to no avail. Is there
> anyway we can let htdig keep an index of the numbers in the files as well and
> also numbers with - and _ with them ?

If you're using parse_doc.pl as an external parser for application/pdf files
(it uses pdftotext to convert the PDF files), you'll need to remove or
comment out this line (about line # 138):

        s/[\-\255]/ /g; # replace hyphens with space

Then, as Geoff suggested, you'll need to set the allow_numbers and
extra_word_characters attributes in your htdig.conf file. You'll need to
reindex after that. The extra_word_characters attribute was introduced
in 3.1.2.

Underscore characters shouldn't pose a problem in any case.

The hyphens within numbers could pose further problem for you if, in
the PDF files, there are line breaks at the hypens, like this: 078-
001. There's code in parse_doc.pl to dehyphenate words, but not to
rejoin hyphenated numbers. You'd need to modify the code to handle
these, if it turns out to be a problem for you. Of course, for numbers,
you'd need to keep the hyphen in when you rejoing them.

-- 
Gilles R. Detillieux              E-mail: <grdetil@scrc.umanitoba.ca>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:    (204)789-3930
------------------------------------
To unsubscribe from the htdig mailing list, send a message to
htdig@htdig.org containing the single word "unsubscribe" in
the SUBJECT of the message.



This archive was generated by hypermail 2.0b3 on Wed May 12 1999 - 13:38:31 PDT