Subject: Re: [htdig] Searching a single PDF file
From: Gilles Detillieux (firstname.lastname@example.org)
Date: Fri Oct 13 2000 - 16:21:38 PDT
According to "Yen, Kevin (Contractor)":
> There's a big PDF file that I'd like to search. When converted to HTML, it is
> about a thousand pages. The HTML pages don't look as nice as they ought to, and
> they're not linked to each other as the PDF pages were.
> Is there some way to get ht://Dig to index a single PDF file so that when a
> search term is found, a link is created to the position of the term in the page?
I'm afraid I don't have an easy answer for you. htdig treats each
document as a single entity, so it would probably take some serious
reworking to make it do otherwise.
A kludge that might do at least part of what you want would be
to write an external converter that spat out anchor tags, e.g.
<a name="page102"></a>, at the start of each page (don't ask me how),
and then hack htsearch to use these anchors as links in the excerpt.
It already uses anchors in this way for HTML files, but you'd need to
hack it to change the URL format into something that would work with PDFs.
(Again, don't ask me how, but there's supposed to be a way to specify in
the URL which is the first page of the PDF to be displayed.) Finally,
because htsearch only gives one match per document, you'd only get the
first matching page, so you'd need to use the multi-excerpt patch in
the patch archives, and possibly hack it too to handle your PDF anchors.
-- Gilles R. Detillieux E-mail: <email@example.com> Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/~grdetil Dept. Physiology, U. of Manitoba Phone: (204)789-3766 Winnipeg, MB R3E 3J7 (Canada) Fax: (204)789-3930
------------------------------------ To unsubscribe from the htdig mailing list, send a message to firstname.lastname@example.org You will receive a message to confirm this. List archives: <http://www.htdig.org/mail/menu.html> FAQ: <http://www.htdig.org/FAQ.html>
This archive was generated by hypermail 2b28 : Fri Oct 13 2000 - 16:26:16 PDT