Re: htdig: PDF Support


Avi Rappoport (avirr@lanminds.com)
Wed, 29 Jul 1998 08:31:31 -0700


Colin,

That's very helpful -- may I publish it on my searchtools site?

Avi

At 11:07 AM -0700 7/28/98, Colin Viebrock wrote:
>Also sprach Alex Block (at 12:43 PM 7/28/98 -0400) ...
>>Given that the htdig archive at eosys is down, can someone advise me as to
>>the steps required to provide PDF support within htdig?
>
>First you need to install Adobe Acrobat Reader on your server. Get the
>latest version from:
> http://www.adobe.com
>
>Second, you need to run the patch that's included in htdig-pdf.tgz,
>available at:
> ftp://sol.ccsf.cc.ca.us/htdig-patches/3.0.8b2/
>
>Don't compile it yet because ...
>
>Third, make the following changes, as pointed out by Sylvain Wallez:
>
><start quote>
>The first one is a bug in PDF.cc (doesn't seem to happen on the PDF
>files on my Intranet, but we only use Acrobat to produce them). Here's
>the diff he sent me :
>
>diff -c htdig/PDF.cc.old htdig/PDF.cc
>*** htdig/PDF.cc.old Wed Jul 15 10:46:03 1998
>--- htdig/PDF.cc Tue Jul 14 10:21:38 1998
>***************
>*** 280,286 ****
> }
>
> }
>! else if (line == "BT")
> {
> // Beginning of text block
> if (debug > 3)
>--- 280,286 ----
> }
>
> }
>! else if ( mystrncasecmp( line.get(), "BT", 2 ) == 0 )
> {
> // Beginning of text block
> if (debug > 3)
>
>
>The second problem is that the default value for the "bad_extension"
>attribute contains .pdf, which causes all pdf files to be ignored by
>htdig, even if a parser is available.
>
>To correct this, you can either put a "bad_extension" list without
>".pdf" in your config file (this is what I did), of apply the following
>patch to htcommon/defaults.cc :
>
>diff -c htcommon/defaults.cc.old htcommon/defaults.cc
>*** htcommon/defaults.cc.old Fri Aug 15 01:59:25 1997
>--- htcommon/defaults.cc Mon Jul 13 19:37:33 1998
>***************
>*** 37,43 ****
> {"add_anchors_to_excerpt", "true"},
> {"allow_numbers", "false"},
> {"allow_virtual_hosts", "true"},
>! {"bad_extensions", ".wav .gz .z .sit .au .zip .tar
>.hqx .exe .com .gif .jpg .jpeg .aiff .pdf .class .map .ram"},
> {"bad_word_list", "${common_dir}/bad_words"},
> {"create_image_list", "false"},
> {"create_url_list", "false"},
>--- 37,43 ----
> {"add_anchors_to_excerpt", "true"},
> {"allow_numbers", "false"},
> {"allow_virtual_hosts", "true"},
>! {"bad_extensions", ".wav .gz .z .sit .au .zip .tar
>.hqx .exe .com .gif .jpg .jpeg .aiff .class .map .ram"},
> {"bad_word_list", "${common_dir}/bad_words"},
> {"create_image_list", "false"},
> {"create_url_list", "false"},
>
>Thanks to M.J. Long for bug hunting.
>
><end quote>
>
>Now, you can do a configure, make clean, make and make install. Voila, PDF
>parsing!
>
>.........................................................................
>Colin Viebrock Creative Director - Private World Communciations
>cmv@privateworld.com http://www.privateworld.com
>ICQ: 11386088
>
> If puns were deli meat,
> this would be the wurst.
>
>----------------------------------------------------------------------
>To unsubscribe from the htdig mailing list, send a message to
>htdig-request@sdsu.edu containing the single word "unsubscribe" in
>the body of the message.

________________________________________________________________
Avi Rappoport, Web Site Search Tools Maven
<mailto:avirr@lanminds.com> <http://www.searchtools.com>

----------------------------------------------------------------------
To unsubscribe from the htdig mailing list, send a message to
htdig-request@sdsu.edu containing the single word "unsubscribe" in
the body of the message.



This archive was generated by hypermail 2.0b3 on Sat Jan 02 1999 - 16:26:55 PST