Gilles Detillieux (email@example.com)
Thu, 26 Aug 1999 10:57:22 -0500 (CDT)
According to Ken Whichard:
> We have been using htdig for about a month and are still learning. I have a situation that I don't understand - any help would be appreciated.
> In the .conf file, I have set the following:
> excerpt_show_top: yes
> excerpt_length: 300
> We have about 300 documents (pdf format) that have a note at the bottom that looks like - "*Every major requires..." (note the asterisk)
> When I perform a search for one of the documents by title, that note line shows up as the top line of the header that is printed in the long format.
> I don't want that to happen.
> Any ideas?
The problem with PDFs is that the order in which the text appears in
the file is not necessarily the order in which you'll read it on a
finished page. When you index PDFs using acroread, or using
"pdftotext -raw" in an external parser, you get the text in the order
in which it appears in the document.
All this stems from the fact that applications talk to Adobe's PDF Writer
like they talk to a printer driver, and many of these applications won't
care about the order of things, as long as they appear at the right spot
on the page. It's not at all uncommon for an application to put out
page footers before anything else on the page, or put out text blocks in
an order other than strictly top to bottom. When we created PDFs from
Corel DRAW, it put out all the large caps before any other text in the
text blocks, so when we'd index these, many of the words were missing
their first letter.
One possible solution would be to use parse_doc.pl and pdftotext, but
take off the -raw option. That will make use of pdftotext's coalescing
feature, which will put text in the order it should appear on the page.
This will only work well for single-column, portrait-oriented text,
so if your PDFs fit the bill, that's the way to go. It did the trick for
our Corel DRAW-generated PDFs.
-- Gilles R. Detillieux E-mail: <firstname.lastname@example.org> Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/~grdetil Dept. Physiology, U. of Manitoba Phone: (204)789-3766 Winnipeg, MB R3E 3J7 (Canada) Fax: (204)789-3930
------------------------------------ To unsubscribe from the htdig mailing list, send a message to email@example.com containing the single word unsubscribe in the SUBJECT of the message.
This archive was generated by hypermail 2.0b3 on Thu Aug 26 1999 - 08:59:29 PDT