Re: [htdig] different search results


Subject: Re: [htdig] different search results
From: Gilles Detillieux (grdetil@scrc.umanitoba.ca)
Date: Wed Nov 15 2000 - 14:08:12 PST


According to gkalter:
> Hope this mailing-list is the right one..;-)
>
> Today I got htdig to work pretty well on a site containing many
> PDF-Files.
>
> Cobalt Raq2 micorserver (mips) with RedHat based Linux
>
> After updating the C++ Compiler (see mailing list) I got rid of the
> segmenatition
> error messages and htdig worked well.
>
> Cryptic outputs of the search form were solved by adding a ".cgi"
> extension to htsearch
> in the local cgi-bin folder. Solution also found in the list - thanks to
> all those helpful people!

I think the FAQ also has some pointers on getting the CGI to work.

> Because I wanted to get direct links to single PDF Pages out of the
> found excerpts I got
> the pdftodig.py script for external parsing of PDF-Files. (Do I have to
> mention that python
> IS NOT installed on Cobalt Raqs?) O.K. this problem could also be
> solved.

It would also be a fairly trivial change to the perl scripts conv_doc.pl
or doc2html.pl to make it replace form feeds in pdftotext output with
the correct HTML <a name="..."> tags for the anchors. You'd then be
using an external converter, rather than an external parser, and possibly
avoiding parser-related problems.

> Now everything works pretty good with one little exception.
>
> Using a complete search string e.g. "Sensor" lists all matching
> documents and the text contains
> the search word (bold typeface) with a link to the specific single Page
> of the found PDF file.
> (Great!)
>
> Typing just a substring e.g. "Senso" in the search form seems to list
> same results. But unfortunately the links within
> the texts are gone.

Sounds like one of two problems:

1) the maximum_word_length setting is too low, so you're getting truncated
words in the database causing false matches which aren't found in the
excerpt.

2) the pdftodig.py script is somehow truncating the words for the word
records, or otherwise generating word records that don't match the words
in the header record it puts out. Try running it manually on one of the
PDFs where you had problems with false matches, and see what it puts out
both in the "h" record and in the "w" records, to see if there are any
discrepancies.

Generally, entering just a substring in the search form isn't enough
to get a match, unless you're using the prefix or substring fuzzy match
algorithms. However, the fuzzy match algorithms generate an expanded list
of matches so that all matched words should be highlighted. It seems
to me that somehow you're getting substrings in your word database,
which it the cause of the problem.

-- 
Gilles R. Detillieux              E-mail: <grdetil@scrc.umanitoba.ca>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:    (204)789-3930

------------------------------------ To unsubscribe from the htdig mailing list, send a message to htdig-unsubscribe@htdig.org You will receive a message to confirm this. List archives: <http://www.htdig.org/mail/menu.html> FAQ: <http://www.htdig.org/FAQ.html>



This archive was generated by hypermail 2b28 : Wed Nov 15 2000 - 14:16:16 PST