Re: [htdig] pdf parser: No error;) Search: No results;(


Gilles Detillieux (grdetil@scrc.umanitoba.ca)
Tue, 23 Feb 1999 14:27:00 -0600 (CST)


According to Joe R. Jah:
> On Mon, 22 Feb 1999, Gilles Detillieux wrote:
> > According to Joe R. Jah:
> > > I run ht/Dig 3.1.1 including the parser patch on a BSDI 4.0 box. In my
> > > htdig.config I have:
> > >
> > > pdf_parser: /usr/contrib/bin/pdftops
> > >
> > > rundig does not complain about any pdf files ...
..
> > > however, search does not find any words in pdf
> > > files; they do not show up in any results.
> > >
> > > Has anyone successfully used pdftops to dig pdf files?
> > >
> > > I appreciate any pointers.
> >
> > The code in htdig/PDF.cc expects the PostScript output from the pdf
> > parser to be in a very specific format -- the one that acroread outputs.
> > The latest version of xpdf is supposed to output PostScript in a
> > compatible format, from what I've read on this list, but I haven't seen
> > any mention of pdftops. My guess, given the lack of results you reported,
> > is that it's PostScript output is not compatible. If it doesn't find
> > the tags it expects in the PostScript, it won't give any error messages.
> > It'll just silently ignore what's there as it scans for the beginning
> > of text block marker.
>
> pdftops is part of xpdf package; I just downloaded, compiled, installed
> the latest version of xpdf, and randig. Still no search results;*(

Hi again, Joe. I tried the latest version of pdftops myself, from xpdf 0.80,
and I found out why it doesn't work. It still does not generate any BT or ET
tags, so PDF.cc ignores the PostScript it generates.

I searched the archives to find claims that it did work, and I found this
thread, posted by Rick Wiggins:

> At 1:15 PM -0400 1/11/99, Geoff Hutchison wrote:
> >At 11:35 AM -0400 1/11/99, Rick Wiggins wrote:
> >
> >>Perhaps future versions of 'htdig' can generalize the 'pdf_parser'
> >>attribute such that this modification would not be necessary when using
> >>programs other than Acroread. Just a thought...
> >
> >Future versions will do so (see the TODO.html file). However, see below.
> >
> >>comes with a 'pdftops' utility program. To use this program, I had to
> >>modify 'htdig' so that it wouldn't include the '-toPostScript' command
> >>option and would completely specify the output filename, like this:
> >
> >Mm. Last time this came up, when the PDF parser was first included, I was
> >given a pretty definitive answer from Michael J. Long <mjlong@summa4.com>:
> >
> >>I have looked at the output from acroread and from xpdf's version of
> >>pdftops and they differ slightly. Sylvain's PDF module uses acroread
> >>specific tags (BT and ET) to determine where to start searching for
> >>words to index. Unfortunately, pdftops does not insert these tags into
> >>the PostScript output.
> >>
> >>Therefore, the PDF module will not work with pdftops as is. I have some
> >>theories on how to tweak the PDF module to work with both:
> >> - convert the pdf to ps and use the Postscript module to
> >> parse it (looking at the way the modules work, I don't
> >> know if this is possible, I haven't look at it that much
> >> though)
> >> - convert the pdf to text and parse the text
> >> - improve the parsing capability by stealing code from
> >> the Postscript module
> >
> >Now if the situation has changed, let me know. In the meantime, I'm not
> >going to suggest using xpdf. I'd rather not suggest acroread since it's not
> >open source. But...
>
> Interesting. 'pdftops' seems to be working fine for me. :-/ I'm using
> version 0.80 of 'xpdf' which came out on Nov. 27, 1998. Perhaps this
> problem has been corrected in this version? We'll be indexing a large
> number of PDFs in the near future. I'll report back how it goes using
> 'pdftops'...

So, Rick, can you confirm that pdftops works as a pdf_parser in htdig?
Did you make any changes at all, to the htdig source or the xpdf source,
to get it to work? Which version of htdig were you using to get this to
work? Could I get a sample of PostScript from your pdftops, from a document
that you were able to index? I'd really like to get to the bottom of this.
And hey, you did promise to report back... :-)

Back in the summer, there was also some discussion about using ghostscript
5.10 as a PDF to PS or PDF to text converter. Has anyone had any success
using this as a means of indexing PDF files with htdig? I wouldn't mind
seeing a sample of PostScript generated by gs 5.10's pdf2ps utility, which
I don't have handy right here.

-- 
Gilles R. Detillieux              E-mail: <grdetil@scrc.umanitoba.ca>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:    (204)789-3930
------------------------------------
To unsubscribe from the htdig mailing list, send a message to
htdig@htdig.org containing the single word "unsubscribe" in
the SUBJECT of the message.



This archive was generated by hypermail 2.0b3 on Fri Feb 26 1999 - 14:34:12 PST