[htdig] external parser causes htdig core dump (was Re: Beware of 3.1.0 for Sun-sparc)


Gilles Detillieux (grdetil@scrc.umanitoba.ca)
Thu, 11 Feb 1999 14:20:26 -0600 (CST)


According to Frank Richter:
> Got another segmentation fault (after digging 40,000 docs) - this occured
> during parsing a Word doc
> http://www.tu-chemnitz.de/wirtschaft/bwl2/download/portrait.doc
> via external parser parse_word_doc.pl.
>
> I've no idea if this portrait.doc is ok, but our robust digger shouldn't
> die by M$ docs... (I knew it, parsing word docs must be dangerous... :-)
>
> (BTW, contrib/htparsedoc/parse_word_doc.pl has errors - wrong line breaks)

Yes, this was a script developed by Jesse op den Brouw which got munged
by a mail program before making its way into contrib/htparsedoc.
The correct script can be taken from:

        http://www.st.hhs.nl/htdig/parse_word_doc.pl

However, Jesse and I have been in contact about a problem very much like
the one you report. In Jesse's case, the problem was a WordPerfect
document that had a .doc suffix, so the http server tagged it as
application/msword. The catdoc program barfed out all sorts of garbage
(not unlike the stuff passed to got_word in the backtrace you sent).
I don't know why this leads to a core dump, though. I can't reproduce
the problem on my Red Hat Linux 4.2/Intel box here!

I've developed a more versatile version of Jesse's perl script, which
you can get here:

        http://www.scrc.umanitoba.ca/htdig/rpms/parse_doc.pl

It checks for WordPerfect documents, and doesn't run catdoc on them.
It also checks for PostScript files, and runs ps2text (from the
Ghostscript package) on them. This requires a "file" command that
can distinguish between WP and Word files. Your portrait.doc file
is actually an RTF format file, which catdoc also barfs on, so you'd
need to adapt the script to eliminate those too. Trouble is ".doc"
is a fairly ambiguous suffix, used for documents of all sorts, and not
necessarily exclusively for MS Word documents.

That's still just working around the problem, though! What I'd like to
know is why, on some systems, got_word causes a core dump when given
a garbage word. The String functions shouldn't care what characters
appear in these strings. Also, got_word doesn't do a whole lot itself,
but mostly passes the word onto a few other functions - I find it odd
that the core dump doesn't happen at a deeper level of nesting, if it
has a problem with the characters.

Could you set up a configuration file that digs only this document, e.g.:

start_url: http://www.tu-chemnitz.de/wirtschaft/bwl2/download/portrait.doc

and then run htdig with -vvvvvv, using this configuration, and your
current parse_word_doc.pl script. I'd like more info about what's
happening prior to the core dump.

> (gdb) bt
> #0 0x1b550 in Retriever::got_word (this=0xeffff6d8,
> word=0x10b8c9a
> "$J.\231\2049>\213:\031N\0162\2264\005\204vv\006\03182hkw",
> location=0, heading=272) at Retriever.cc:876
> #1 0x1ee10 in ExternalParser::parse (this=0x435100,
> retriever=@0xeffff6d8,
> base=@0xca8d68) at ExternalParser.cc:168
> #2 0x1a6e0 in Retriever::RetrievedDocument (this=0xeffff6d8,
> doc=@0x1eaaf0,
> ref=0x83de50) at Retriever.cc:556
> #3 0x1a2ac in Retriever::parse_url (this=0xeffff6d8, urlRef=@0x44b788)
> at Retriever.cc:458
> #4 0x19cf0 in Retriever::Start (this=0xeffff6d8) at Retriever.cc:288
> #5 0x1e188 in main (ac=9, av=0xeffff8ec) at main.cc:245

-- 
Gilles R. Detillieux              E-mail: <grdetil@scrc.umanitoba.ca>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:    (204)789-3930
------------------------------------
To unsubscribe from the htdig mailing list, send a message to
htdig@htdig.org containing the single word "unsubscribe" in
the SUBJECT of the message.



This archive was generated by hypermail 2.0b3 on Wed Feb 17 1999 - 10:10:02 PST