Thu, 18 Dec 1997 13:50:17 +0100 (MET)
Tim White writes:
> Also I'm looking for an elegant method of indexing pdf files. I've
> my pdf files into postscript with xpdf and then index the postscript files.
> I'd like to have the search return a pdf file instead of a postscript files.
> I have this working in a kludgy kind of way by renaming files, index, rename
> files. I thought of using xpdf source to add a .pdf parser but this would
> be involved I think.
I am indexing text extracted from postscript that is normally presented
to the user with a CGI interface. The solution for this was to create a
CGI script that does three things:
1. when called without a document name argument it presents a list of
links to all documents. These links are to the CGI itself
2. when called with a document name it examines the HTTP_USER_AGENT to
tell if it was called from the search engine (htdig), or from a real
browser. Browser calls are redirected to the normal viewing software at
a different URL.
3. when the search engine calls the CGI to get a document, a stripped
down version of the document is generated for indexing. This version
consists only of a document title and the plain text contained in the
The parameters to the CGI are contained in the path so htdig will not
recognize it as a CGI.
This approach should work for your problem, if you know how to extract
the plain text from your PDF files - perhaps also a nontrivial task (in
theory Ghostscript should do it, but... (1) it may not be to robust and
(2) the text contained in PDF documents may not be good enough for
useful indexing, especially if it contains non ASCII characters and/or
was converted from TEX-Documents.
PS: you can look at our documents via
or at the indexer interface at
-- Guenter Radestock, Universitaetsbibliothek Karlsruhe firstname.lastname@example.org http://www.ubka.uni-karlsruhe.de/~guenter ---------------------------------------------------------------------- To unsubscribe from the htdig mailing list, send a message to email@example.com containing the single word "unsubscribe" in the body of the message.
This archive was generated by hypermail 2.0b3 on Sat Jan 02 1999 - 16:25:25 PST