Re: [htdig] Getting URL names to show up in index.


Subject: Re: [htdig] Getting URL names to show up in index.
From: Gilles Detillieux (grdetil@scrc.umanitoba.ca)
Date: Mon Jun 05 2000 - 12:11:32 PDT


According to naughton@domino.danielwoodhead.com:
> Thanks Gilles,
>
> I clicked on the link in the footer of you email, it looks like my problem
> could be a minor task for some. I guess that's where the "easy matter"
> comment came from. I wouldn't have the vaguest idea how to begin that one
> :)
>
> There was a parse_doc.pl script that I downloaded with htdig. But the
> directions said that if acroread was in the path, it would find it and
> parse the .pdf's by default. If you wanted an external parser other than
> acroread, you would have to specify it in the htdig.conf. I tried it both
> ways, with similar results. I finally left it on the default (acroread).
>
> I had some other feedback like this:
>
> I think what you meant to say is that you want to *search* on parts of a
> filename. (You can already get URL names to show up in search
> results--this is part of the $(URL) variable).
>
> This has been requested a few times, but no one has offered anything in
> terms of implementation. It probably needs something in Retriever.cc after
> it gets through parsing a file to "parse" the URL.
>
> Personally, I'd put the string in your files somewhere (doesn't PDF have a
> "comments" or "keywords" portion). This will also make it easier for other
> search engines or browsers to get the information.
>
> Dan Naughton

Geoff's suggestion would work, but it could be tedious to manually enter
the file name (or parts of it) into the title field of each PDF, using
Adobe's Acrobat Exchange.

An alternative that doesn't really involve any programming is to install
the xpdf package and the conv_doc.pl script, and change the PDFINFO
definition in conv_doc.pl to "/bin/true". Then, add an external_parsers
definition in your htdig.conf, as shown in conv_doc.pl's comments. In
this way, when it parses PDFs, it won't run the real pdfinfo program,
so it won't grab the real title field from the PDF (if one is defined),
so it will fall back to making up a title like this:

        PDF Document 123456_latest.pdf

If you're feeling a bit more adventurous, and would want the title to
include both the real contents of the PDF's title field plus the file
name, you could instead define PDFINFO to be the real pdfinfo program,
and then find this section of conv_doc.pl:

# print out the title, if it's set, and not just a file name, or make one up
if ($title eq "" || $title =~ /^[A-G]:[^\s]+\.[Pp][Dd][Ff]$/) {
    @parts = split(/\//, $ARGV[2]); # get the file basename
    $title = "$type Document $parts[-1]"; # use it in title
}

and change it to this:

# print out the title, if it's set, and not just a file name, or make one up
@parts = split(/\//, $ARGV[2]); # get the file basename
$title = "$type Document $parts[-1] - $title"; # use it in title

This will give you something like:

        PDF Document 123456_latest.pdf - Title of your document

Either way, the number in the filename will get parsed as a word in the
title. You'll need to keep your allow_numbers and valid_punctuation
attributes as shown in your earlier e-mail message, so that htdig will
parse and store the number separately.

-- 
Gilles R. Detillieux              E-mail: <grdetil@scrc.umanitoba.ca>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:    (204)789-3930

------------------------------------ To unsubscribe from the htdig mailing list, send a message to htdig-unsubscribe@htdig.org You will receive a message to confirm this.



This archive was generated by hypermail 2b28 : Mon Jun 05 2000 - 10:01:11 PDT