[htdig] parse_doc.pl + pdftotext = El Perfecto -.0001:*)


Joe R. Jah (jjah@cloud.ccsf.cc.ca.us)
Sat, 27 Feb 1999 17:16:10 -0800 (PST)


Hi Gilles,

El Perfecto:

Thank you very much for your giant leap for PDF kind;)

I applied your second patch to parse_doc.pl and Derek's fix to
xpdf/TextOutputDev.cc; now all the PDF files in my search path are indexed
using the external parser directive in the config file:

external_parsers: application/msword /usr/local/bin/parse_doc.pl \
                   application/postscript /usr/local/bin/parse_doc.pl \
                   application/pdf /usr/local/bin/parse_doc.pl

-.0001:

One crappy PDF file creates a score of errors during the dig:

  External parser error in line:w^@(Garbage)*

It also appears in the search results as:

  Word Document prereg.pdf

instead of

  PDF Document prereg.pdf

The file is:

  http://www.ccsf.cc.ca.us/Resources/Title3/training/prereg.pdf

It can be searched with:

  http://www.ccsf.cc.ca.us/cgi-bin/htsearch?config=htdig&restrict=\
  &exclude=&words=pre-registration+form&method=and&format=builtin-short

No other word in that file gives a search result, I guess the error had
happened at the top of the file after the line Pre-Registration Form.

P.S. I couldn't correspond during the week because I had a hectic one;
come to think of it, I have one every week;)

Best regards,

Joe

     _/ _/_/_/ _/ ____________ __o
     _/ _/ _/ _/ ______________ _-\<,_
 _/ _/ _/_/_/ _/ _/ ......(_)/ (_)
  _/_/ oe _/ _/. _/_/ ah jjah@cloud.ccsf.cc.ca.us

------------------------------------
To unsubscribe from the htdig mailing list, send a message to
htdig@htdig.org containing the single word "unsubscribe" in
the SUBJECT of the message.



This archive was generated by hypermail 2.0b3 on Thu Mar 04 1999 - 09:09:08 PST