Thursday, 29 April 2010

Bug hunting or how dumb can you get, really?

As a programmer you sometimes can't help but feel really stupid! Really stupid-stupid! Look at this code snippet for example of a foolish bug that kept me busy for some time. The story behind this is that I added a feature to the product I'm writing for a customer, and was totally happy with it. There was a real customer benefit and it worked like a charm through my tests. However, when deployed to the internal test system, it crashed unexpectedly now and than.

That was a kind of shock, because the product was running stable for more than a year in this environment, and I was totally unsettled as I didn't want to believe that the new feature could have something to do with it. It was tested rock-solid. There must be a mysterious bug somewhere else in the program. And it hid for one year before it disclosed its identity! Bad!

However, after I ran the collected test data on my development machine, it turned out to be a pretty uncomplicated bug:
    for(size_t i = 0; i < a_prInput->frameCount(); i++)
        // header
        size_t start = a_prInput->NettoDataParts[i].uOffset;
        size_t pos = start;
        DnsHeaderData header = parseHeader(a_prInput, pos);
        // interesting data there?
            TRACE("dns: packet isn't a response, ignoring...");
        // skip the questions
        for(int i = 0; i < header.questions; i++)
            if(!parseDnsName(a_prInput, start, pos))
                TRACE_ERR("dns: cannot parse a question record");
                m_uGoodByteCnt = pos - a_prInput->NettoDataParts[i].uOffset;

You see? The code was referencing the data form the outer loop with the index of the inner one! And it did it only in error case, and crashed only if it was out of luck. Arrgh... There was no problem with fixing this, but it started me thinking: I'm programming for some 20 years, I know C++ in and out, why did I make this bug in the first place? Did I violate some programming rule or best practice? Is it possible to avoid such bugs by some sound practices? Some answers come to mind:
  1. There should be a warning from compiler
    That would be the ideal solution. But what warning should that be? And what if I really wanted this that way? I'd need to use an ugly #pragma then? No, better let it be.

  2. I should have used at()What should I say, er, well, at() is for wimps! And if that wouldn't be the case, we have a choice between at() and the [] operator. So we want to excercise our rights!

  3. Use telling names
    Well, of course, I could name each loop variable accordingly, but come on, can you imagine the hassle? Because if this has to be of any good you have to be consequent. So maybe only do this for nested loops? Maybe, but if I start to do this as to avoid this error, and something else to avoid some other error, you guess it, it becomes a pain! We need something more general,
  4. Use for_each(), as we don't do any addressing here
    That's more interesting, as logically speaking, the inner loop isn't indexing single elements, but rather wants only to skip over the uninteresting headers. But, on the other side, for_each() is annoying, as it requires start and end of the sequence, which isn't a problem if I'd use the iterator, but gets pointless if I never want to use the iterator directly!

  5. use iterators to go high level
    Isn't that the same as the above? No, because it's not an application of a specific technique in a specific situation, but rather a conscious decition for rising of the abstraction level. And quite serendipitously we can avoid the above mentioned error! Or are we? What if both outer and inner loop are using the same name for the iterators? Here we are agan back on the square one and have to use for_each() for the inner loop as to be safe.
Alas, none of them is satisfactory. Well, I know, there's one more solution: code reviews (or pair programming). But first, my customer isn't doing them that often, and second, I doubt this bug would be found in a review/pair. But even assuming it would be found: the price is much to hight! It took me a couple of minutes to fix it, once I had the right data, and my customer expects "results just now".
What is the key point? What does this all mean? Are we doomed? I think that there are two possiblities here: either use iterators or some other high level constructs (which admittedly is somehow of an overkill), or be prepared for occasionally fits of stupidity. In the end we are only humans, and errare humanum est... And who wants to be safeguarded against everything?

PS: Sorry, I started this entry long ago, but in the meantime I was so distracted that I couldn't come round to write anything for the blog. Besides, microblogging isn't doing it any easier ;-).
PS2: Other possibility: write shorter functions, use a skipNameRecords() method and you are saved! But really, that function is under 1 page lenght, How long should functions be? Five lines or so? Or seven as adjust to humain brain's cognitive limitations?