Re: [htdig] dig problems and PDF parsers


Subject: Re: [htdig] dig problems and PDF parsers
From: Gilles Detillieux (grdetil@scrc.umanitoba.ca)
Date: Thu Aug 17 2000 - 08:26:37 PDT


According to Stephen L Arnold:
> It's been a while since I built htdig (so maybe I forgot to do something
> important) and I'm having problems with the dig/merge. I get:
>
> DB2 problem...: missing or empty key value specified
>
> I checked the archives for the above, but I didn't find anything helpful.
>
> I have two separate databases, one for html and misc. content, and one
> for M$Word documents. I have different config/search files and database
> dirs, and everything worked fine the last time. Now the second one
> won't build the database (it barfs right away with the above error)
> after the first one builds just fine.

This is a hard one to pin down. You probably found that all such threads
in the archives seem to come to an abrubt end. It's a database corruption
problem, and we never seem to get to the bottom of what causes it.

One cause we've found is if all documents in your database are invalidated
and deleted, the now-empty database seems to give these errors. It may
be a different cause in your case.

Do you get the error when you rebuild the database from scratch (htdig -i)?
If not, that's your solution, at least as long as it doesn't happen again.
If it happens consistently with htdig -i, then please try to pare it down
to a small test case that still fails consistently. If we can reproduce
the problem ourselves, then we'll finally get a step closer to fixing it.

> I thought I would get tricky this time, and edit the configure.in file
> (before rebuilding the htdig binaries) to comment out the acroread
> setting (since it always gives an error). I have the conv_doc.pl
> file set to use catdoc, pdftotext, and ps2ascii (just the defaults)
> but I still get errors when digging pdf files.

Don't worry too much about the acroread setting. If you override
the internal PDF.cc parser with an external parser or converter for
application/pdf in external_parsers, then pdf_parser will not even
be used.

You don't mention specifically what error messages you get when
indexing PDFs, so it's hard to say what the problem might be. General
recommendations are to check your external_parsers and max_doc_size
settings, and try running conv_doc.pl directly on a few of the problem
files to see what output you get then.

-- 
Gilles R. Detillieux              E-mail: <grdetil@scrc.umanitoba.ca>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:    (204)789-3930

------------------------------------ To unsubscribe from the htdig mailing list, send a message to htdig-unsubscribe@htdig.org You will receive a message to confirm this. List archives: <http://www.htdig.org/mail/menu.html> FAQ: <http://www.htdig.org/FAQ.html>



This archive was generated by hypermail 2b28 : Wed Aug 16 2000 - 22:26:53 PDT