Re: [htdig] Getting URL names to show up in index - success


Subject: Re: [htdig] Getting URL names to show up in index - success
From: naughton@domino.danielwoodhead.com
Date: Mon Jun 12 2000 - 07:42:58 PDT


                                                                                                             
                    Gilles Detillieux
                    <grdetil@scrc.uma To: naughton@domino.danielwoodhead.com
                    nitoba.ca> cc: htdig@htdig.org
                                             Subject: Re: [htdig] Getting URL names to show up in index.
                    06/05/2000 02:11
                    PM
                                                                                                             
                                                                                                             

Thanks for you help Gilles, I'm off and running:

I added the conv_doc.pl to /usr/local/bin/ and I edited to htdig.conf to
look like this:

#external_parsers: application/postscript /usr/local/bin/parse_doc.pl
#extra_word_characters: _
allow_numbers: true
valid_punctuation: .-/!#$%^&'
case_sensitive: false
create_url_list: yes
external_parsers: application/msword->text/html /usr/local/bin/conv_doc.pl
\
     application/postscript->text/html /usr/local/bin/conv_doc.pl \
     application/pdf->text/html /usr/local/bin/conv_doc.pl

The pdf files come over like PDF Document 123456_latest.pdf, along with
some of the contents of the file for pdf's that have text (as opposed to
scanned pictures). It works awesome. I'm going to hold off on the "more
adventurous" one for now :) Thanks again.

Dan Naughton

According to naughton@domino.danielwoodhead.com:
>
> There was a parse_doc.pl script that I downloaded with htdig. But the
> directions said that if acroread was in the path, it would find it and
> parse the .pdf's by default. If you wanted an external parser other than
> acroread, you would have to specify it in the htdig.conf. I tried it
both
> ways, with similar results. I finally left it on the default (acroread).
>

Geoff's suggestion would work, but it could be tedious to manually enter
the file name (or parts of it) into the title field of each PDF, using
Adobe's Acrobat Exchange.

An alternative that doesn't really involve any programming is to install
the xpdf package and the conv_doc.pl script, and change the PDFINFO
definition in conv_doc.pl to "/bin/true". Then, add an external_parsers
definition in your htdig.conf, as shown in conv_doc.pl's comments. In
this way, when it parses PDFs, it won't run the real pdfinfo program,
so it won't grab the real title field from the PDF (if one is defined),
so it will fall back to making up a title like this:

     PDF Document 123456_latest.pdf

If you're feeling a bit more adventurous, and would want the title to
include both the real contents of the PDF's title field plus the file
name, you could instead define PDFINFO to be the real pdfinfo program,
and then find this section of conv_doc.pl:

# print out the title, if it's set, and not just a file name, or make one
up
if ($title eq "" || $title =~ /^[A-G]:[^\s]+\.[Pp][Dd][Ff]$/) {
    @parts = split(/\//, $ARGV[2]); # get the file basename
    $title = "$type Document $parts[-1]"; # use it in title
}

and change it to this:

# print out the title, if it's set, and not just a file name, or make one
up
@parts = split(/\//, $ARGV[2]); # get the file basename
$title = "$type Document $parts[-1] - $title"; # use it in title

This will give you something like:

     PDF Document 123456_latest.pdf - Title of your document

Either way, the number in the filename will get parsed as a word in the
title. You'll need to keep your allow_numbers and valid_punctuation
attributes as shown in your earlier e-mail message, so that htdig will
parse and store the number separately.

--
Gilles R. Detillieux              E-mail: <grdetil@scrc.umanitoba.ca>
Spinal Cord Research Centre       WWW:
http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:    (204)789-3930

------------------------------------ To unsubscribe from the htdig mailing list, send a message to htdig-unsubscribe@htdig.org You will receive a message to confirm this.



This archive was generated by hypermail 2b28 : Mon Jun 12 2000 - 05:33:01 PDT