[htdig] PDF and PostScript Parsing


Patrick Dugal (patrick.dugal@nrc.ca)
Tue, 23 Feb 1999 10:06:58 -0500


Hello all,

I've discovered that ht://Dig 3.1.0b1's internal parsing
misbehaves with many PDF documents, although it behaves well
with some. My concern is with the internal parsing of the
acroread output. It's my understanding that the way the
output from acroread is parsed hasn't changed in the new
version of PDF.cc, so this probably also applies to the
newest version of ht://Dig.

The problem occurs when searching for a word that is
definitely contained in a PDF file which was indexed, the
search results come back with the following snippet, for
example:

[o98-900.pdf]
     ... -
nationsmustbepreparedtosubmitallstructuraldatarequired
tovalidatethediscussiontotheProteinDataBank(Biology
     Department,Bldg.463,P.O.Box5000,BrookhavenNational
Laboratory,Upton,NY11973-5000,U.S.A.).Allrelevantnu-
     cleicacidsequenceinformationmustbedepositedintheGen-
Bankdatabase(GenBankSubmissions,NationalCenterfor ...
     http://mydomain.ca/o98-900.pdf 02/18/99, 46490 bytes

As you can tell from this snippet, the internal parsing of
the acroread output is not quite what you'd expect. The
strings get concatenated somehow and so the data becomes
nearly useless and impossible to search.

Is there a way I can configure htdig to disable the internal
parsing of the acroread output? I'd like to use the
pdftotext program included in the xpdf software to do the
whole conversion from PDF to text and have htdig receive
this file internally in the indexing process. How would I
go about doing that without changing the source code?

Any of your suggestion would be very helpful.

Pat :)

------------------------------------
To unsubscribe from the htdig mailing list, send a message to
htdig@htdig.org containing the single word "unsubscribe" in
the SUBJECT of the message.



This archive was generated by hypermail 2.0b3 on Fri Feb 26 1999 - 14:34:12 PST