Re: [htdig] Using pdftotext to index PDF documents


Geoff Hutchison (ghutchis@wso.williams.edu)
Mon, 1 Mar 1999 16:41:56 -0500


[Let's move this discussion to htdig3-dev, since it's a bit off-topic for
htdig]

>ungets them. If it gets a lowercase letter followed by a tab, it goes
>on to function using the currently defined protocol. If it gets "Co",
>it gets the first line to look for a Content-type header, and passes
>the rest of the input to a new

But this requires the parser to recognize the content type! I'd rather put
this intelligence in the htdig code. Then a "parser" or "decoder" for
compressed files could be something like this:

external_decoder: application/gzip zcat \
                application/bzip2 bzcat

This seems much easier for the user and parser-writer. Getting a file of
magic headers is pretty easy, and anyone running Apache already has one. If
we don't want to add another attribute and interface, then at least skip
the Content-Type requirement.

While it would be nice to have transparent external parsers as you mention,
it also really complicates them. I'd rather take programs like pdftotext or
ps2ascii and use them directly. Your parse_doc.pl script is nice, but it
still requires *someone* to modify it for new document types. Besides,
doesn't it use magic headers anyway?

>Right now, it seems you can't add external parsers for any arbitrary
>type without adding those types into the code for this function.

Very true. This whole area of code needs to be fixed. There should be one
lookup table for what MIME types can be parsed. Then when a document is
encountered, it goes to the appropriate place.

>pointer to the data, and an integer length. Wouldn't it speed things
>up if the HTML and Plaintext classes did likewise?

Probably.

-Geoff Hutchison
Williams Students Online
http://wso.williams.edu/

------------------------------------
To unsubscribe from the htdig mailing list, send a message to
htdig@htdig.org containing the single word "unsubscribe" in
the SUBJECT of the message.



This archive was generated by hypermail 2.0b3 on Thu Mar 04 1999 - 09:09:18 PST