[htdig] Re: external parser causes htdig core dump

Gilles Detillieux (grdetil@scrc.umanitoba.ca)
Fri, 12 Feb 1999 12:48:42 -0600 (CST)

According to Frank Richter:
> > Could you set up a configuration file that digs only this document, e.g.:
> >
> > start_url: http://www.tu-chemnitz.de/wirtschaft/bwl2/download/portrait.doc
> >
> > and then run htdig with -vvvvvv, using this configuration, and your
> > current parse_word_doc.pl script. I'd like more info about what's
> > happening prior to the core dump.
> I did it, see attached file. You see many many binary data...
> Of course a workaround is to change the external parser to avoid such
> garbage, but htdig should be robust enough...

The log file you sent me unfortunately didn't tell me much, but I did
manage to reproduce the problem. I realised, when I saw how big the
portrait.doc file was, that my htdig was truncating it. I increased
max_doc_size to 2000000, and sure enough, htdig dumped core on your

In looking at your stack backtrace previously, I was so focused on the
garbage words that got_word was getting, that I failed to realise the
problem was the value for heading, which was way out of range, and was
being used, unchecked, as an array subscript.

The problem you reported seems to be different than the one Jesse had,
which I still can't reproduce, but I hope that with this patch, and my
earlier fixes to ExternalParser.cc, it'll solve that problem too!

Here's the patch for your problem, Frank. Now, instead of getting a core
dump, you'll get a whole bunch of External parser error messages. For the
sake of defensive programming, Retriever::got_word() should probably still
be fixed to check "heading" before using it as a subscript, but I decided
to put a check in ExternalParser.cc so the error can be reported there.

--- ./htdig/ExternalParser.cc.wordbug Tue Feb 9 18:26:08 1999
+++ ./htdig/ExternalParser.cc Fri Feb 12 12:22:52 1999
@@ -148,6 +148,7 @@
     String line;
     char *token1, *token2, *token3;
+ int loc, hd;
     URL url;
     while (readLine(input, line))
@@ -164,8 +165,10 @@
                   token2 = strtok(0, "\t");
                 if (token2 != NULL)
                   token3 = strtok(0, "\t");
- if (token1 != NULL && token2 != NULL && token3 != NULL)
- retriever.got_word(token1, atoi(token2), atoi(token3));
+ if (token1 != NULL && token2 != NULL && token3 != NULL &&
+ (loc = atoi(token2)) >= 0 && loc <= 1000 &&
+ (hd = atoi(token3)) >= 0 && hd < 12)
+ retriever.got_word(token1, loc, hd);
                   cerr<< "External parser error in line:"<<line<<"\n";

Gilles R. Detillieux              E-mail: <grdetil@scrc.umanitoba.ca>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:    (204)789-3930
