Re: [htdig] Using pdftotext to index PDF documents


Gilles Detillieux (grdetil@scrc.umanitoba.ca)
Mon, 1 Mar 1999 15:20:17 -0600 (CST)


According to Patrick Dugal:
> Gilles Detillieux wrote:
> > There's still a bit more work to be done. Patrick mentioned that
> > pdftotext changed hyphens to spaces.
>
> I don't think I ever said that. I mentioned that GhostScript's ps2ascii which takes
> pdf as input whenever it feels like it, translates hyphens (-) into spaces. Xpdf's
> pdftotext leaves the hyphens in, just as they are in the pdf. This may hinder the
> results of a search, but at least it's consistent.

Sorry, my mistake. In any case, hyphens aren't a problem with my
parse_doc.pl script. One of the first changes I made to it, back when
it was parse_word_doc.pl for Word documents only, was to change hyphens
to spaces when generating the word list.

> > (Which raises the question: "why can't an external
> > parser just pass plain text or HTML to htdig for further parsing?")
>
> Very good question. By intuition, I thought this was the way it should work. This
> way, it would be easier to configure, without having to get into any programming
> adjustments.

Geoff proposes dealing with external parsers and external decoders as
two separate entities, each with its own configuration attribute, and
different interface. I would propose instead to extend the external
parser interface in this way:

When reading from the pipe, from the external parser, htdig would
first pre-read the first two input characters, to peek ahead, then
ungets them. If it gets a lowercase letter followed by a tab, it goes
on to function using the currently defined protocol. If it gets "Co",
it gets the first line to look for a Content-type header, and passes
the rest of the input to a new

        Document::RetrieveOpenFILE(FILE *input, char *content_type)

method, which replaces the current document contents with the
new contents. If there's no Content-type header, it just passes
the whole thing to this new method, with a NULL content_type, and
this method determines the type by magic number. It would then do a
doc->getParsable()->parse(retriever, base); to re-parse the new contents.
The only missing piece, apart from RetrieveOpenFILE, is a Retriever.h
inline method to get a pointer to the current Document, from the private
field in the retriever object.

How does this idea sound to everyone? I know the peeking ahead is a
bit of a kludge, but I didn't want to have to redefine the current
protocol for external parsers, while retrofitting this new approach.
Apart from that, does the rest sound reasonable? I'm not really
familiar with all the object structures yet, but it seems there's only
one Document object, which the Retriever object points to. So, I think
what I suggested above is probably the most unobtrusive way of adding
this functionality.

In looking over all this code, I came up with a couple other
observations/questions on things that could be changed or improved in
the existing code.

1) In Document::readHeader(), it checks the Content-type header value
against certain built-in values to determine whether the document
has text or not. This prevents fetching files which it can't
parse, which is a good thing, but shouldn't it be checking against
ExternalParser::canParse() for any type it can't parse internally?
Right now, it seems you can't add external parsers for any arbitrary
type without adding those types into the code for this function.

2) Document::getParsable() calls Parsable::setContents() to pass the
document contents to the parser it chooses, but this ends up copying the
whole document needlessly into another String. The PDF class avoids
this by redefining its setContents() method to simply store a char *
pointer to the data, and an integer length. Wouldn't it speed things
up if the HTML and Plaintext classes did likewise?

> > Some users may also want to extract the titles from their PDFs, as
> > Sylvain's code did.
>
> The "title" field located in a pdf is not as meaningful as one would like. As far as
> I know, there is no consistent way to extract the real title of a document. How does
> Adobe expect people to be able to index large numbers of PDF's?

Unless you give PDFwriter a meaningful title whenever you make a PDF,
it won't give you anything terribly useful by default. Yes, this
is a problem, but it may be that there was no way to consistently
extract meaningful titles from all the applications that can interact
with PDFwriter. From the application's perspective, it's like talking
to a printer driver, and I guess the API for that doesn't care about
document titles.

It's a good question to ask of the Adobe people, though, as they have
a whole lot more inside knowledge than any of us do.

-- 
Gilles R. Detillieux              E-mail: <grdetil@scrc.umanitoba.ca>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:    (204)789-3930
------------------------------------
To unsubscribe from the htdig mailing list, send a message to
htdig@htdig.org containing the single word "unsubscribe" in
the SUBJECT of the message.



This archive was generated by hypermail 2.0b3 on Thu Mar 04 1999 - 09:09:18 PST