Re: Most Interesting Bug Track Down



Frederick Gotham wrote:
I thought it might be interesting to share experiences of tracking down a
subtle or mysterious bug. I myself haven't much experience with tracking
down bugs, but there's one in particular which comes to mind.

I was writing usable which dealt with strings. As per usual with my code, I
made it efficient to the extreme. One thing I did was replace, where
possible, any usages of "strlen" with something like:

struct PtrAndLen {
char *p;
size_t len;
};

This could be initialised with a string as follows:

struct PtrAndLen const pal = { "Hello", sizeof "Hello" };

From that point forward in the code, "pal.len" would be used in place of
strlen.

The code grew though, and at one stage I needed to store info about two
strings in one of these structures. To do this, I used a null separator,
e.g.:

PtrAndLen pal = {"Hello\0Bonjour", sizeof "Hello\0Bonjour"};

All of this, however, was expanded by macros, so I actually had something
like:

MAKE_STR_INFO("Hello\0Bonjour")

The problem with this, however, was that "strlen" and "pal.len" had
different values, because strlen only read as far as the first null
terminator. Anyway, I had to read through the code in detail before the bug
jumped out at me.

First of all (sizeof "inline string") is 1+strlen ("inline string").
So I assume you compensated for this in your macro.

Second of all, in "The Better String Library", which does the same
thing, this is not a bug but it fact, the correct behavior. '\0' is a
legitimate character, not a string terminator. Where the semantics
coincide (which is most of the time, when dealing with pure text data)
you can assume strlen(bstring->data) is the same as b->slen. In
Bstrlib, you would never try to mash two strings together using some
kind of hacked representation such as "string1\0string2", that would
make no sense. Because Bstrlib is more consistent in this respect,
these sorts of bugs are far less likely.

I'm sure several regulars here have more interesting stories... :)

Oh sure:

1) if (a < 0) a = -a; b = sqrt (a);

2) Anything involving a stack overrun with stack checking turned off.
You just have to be inspired to imagine that this is your problem. The
standard is worthless for helping you here.

3) Assuming that vararg parameters were passed by value and could be
"reset" by retrieving its original value. (No debugger or compiler
diagnostic can help you figure out what is going wrong here.)

4) Watching Microsoft Visual C++ barf on struct tagbstring b = {
sizeof("string")-1, -__LINE__, "string" }; because MS's preprocessor
emitted something like _line+425 for __LINE__, and it complained that
it was not a compile time constant.

5) Adventures with WATCOM C/C++ v11.x's optimizations with "-ol" turned
on. It just fails to build correct code for about 10% of the source
I've written. These are real fun to track down. Like the stack
checking thing, you just have to be inspired to try turning the flag
off to see if it fixes the problem.

Then there's the standard "I forgot I made assumption X in function Y
then passed it parameters which technically violated X even though it
wasn't obvious that it was". Unfortunately, in the C language, these
assumptions often take the form of "allocated at least some certain
amount of space" or "the parameter is a well form non-empty linked
list" etc, and the error is usually undefined behavior.

I don't do a lot of heap or stack smashing anymore these days, as I
generally wrap things in rigourous enough abstractions, and I just
generally use debug heaps while developing. But there can still be
problems of convention. A hash table I implemented has an iterator
mechanism, and I made the termination condition when the index was
greater than the current hash table size -- the problem is that when I
came back to reuse this code after more than a year, I forgot my
convention for termination and thought it was when the index was < 0.
So I walked off the end of the hash array nicely because I did not
sufficiently document the convention. The problem is that I was using
-1 as the start-up index (since 0 may or may not be a valid entry, and
you *have* to perform an increment on every call to the iterator
incrementor) and so could not use < 0 as the terminator condition. But
it meant that my intuition conficted with what was necessary. I fixed
this by creating an "isDone" macro for the iterator.

With multithreaded errors, I already know a priori that they are
difficult. When I can, and I detect such a bug, I will spend a short
amount of time try to track it down. If I can't get it, I junk the
contentious code and start over. Its just a matter of productivity --
these bugs can be so hard, that it will take longer to track them down
than to rewrite the code. Sometimes I don't learn/figure out what I
did wrong, but life is too short.

--
Paul Hsieh
http://www.pobox.com/~qed/
http://bstring.sf.net/

.



Relevant Pages