[htdig] parse_doc.pl + pdftotext = El Perfecto -.0001:*)

Joe R. Jah (jjah@cloud.ccsf.cc.ca.us)
Sat, 27 Feb 1999 17:16:10 -0800 (PST)

Hi Gilles,

El Perfecto:

Thank you very much for your giant leap for PDF kind;)

I applied your second patch to parse_doc.pl and Derek's fix to
xpdf/TextOutputDev.cc; now all the PDF files in my search path are indexed
using the external parser directive in the config file:

external_parsers: application/msword /usr/local/bin/parse_doc.pl \
                   application/postscript /usr/local/bin/parse_doc.pl \
                   application/pdf /usr/local/bin/parse_doc.pl


One crappy PDF file creates a score of errors during the dig:

  External parser error in line:w^@(Garbage)*

It also appears in the search results as:

  Word Document prereg.pdf

instead of

  PDF Document prereg.pdf

The file is:


It can be searched with:


No other word in that file gives a search result, I guess the error had
happened at the top of the file after the line Pre-Registration Form.

P.S. I couldn't correspond during the week because I had a hectic one;
come to think of it, I have one every week;)

Best regards,


     _/ _/_/_/ _/ ____________ __o
     _/ _/ _/ _/ ______________ _-\<,_
 _/ _/ _/_/_/ _/ _/ ......(_)/ (_)
  _/_/ oe _/ _/. _/_/ ah jjah@cloud.ccsf.cc.ca.us

To unsubscribe from the htdig mailing list, send a message to
htdig@htdig.org containing the single word "unsubscribe" in
the SUBJECT of the message.

This archive was generated by hypermail 2.0b3 on Thu Mar 04 1999 - 09:09:08 PST