Re: [htdig] acroread pdf parser includes "htdig9751.pdf" in PDF file search results


Gilles Detillieux (grdetil@scrc.umanitoba.ca)
Tue, 25 May 1999 12:09:40 -0500 (CDT)


According to Glenn Nielsen:
> I just started indexing PDF documents using acroread.
> All of the search results contain "htdig9751.pdf" at
> the start of the document excerpt in the search results.
> Looks like this is the name of some temp file that htdig
> uses when calling the pdf parser.
>
> Is there a way to confiugre htdig or acroread so
> that this is not included in the excerpt for the pdf
> document?

I believe you've stumbled into another bug in the PDF parser. I think the
patch below will fix this for you. Another (likely better) option is to
switch to an external parser for PDFs. I've found that an external PDF
parser, based on pdftotext in the xpdf 0.80 package, is faster than acroread
and produces better results in most cases.

See http://www.htdig.org/FAQ.html#q4.9

If you'd rather stick to acroread, here's the patch for you to try:

--- htdig/PDF.cc.orig Wed Apr 21 21:47:57 1999
+++ htdig/PDF.cc Tue May 25 12:01:43 1999
@@ -290,8 +290,8 @@ void PDF::parseNonTextLine(String &line)
                         _parsedString.get());
 
                 _retriever->got_title(_parsedString);
- _parsedString = 0;
             }
+ _parsedString = 0;
         }
         
    }

-- 
Gilles R. Detillieux              E-mail: <grdetil@scrc.umanitoba.ca>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:    (204)789-3930
------------------------------------
To unsubscribe from the htdig mailing list, send a message to
htdig@htdig.org containing the single word "unsubscribe" in
the SUBJECT of the message.



This archive was generated by hypermail 2.0b3 on Tue May 25 1999 - 09:22:37 PDT