Sunday 16 February 2014

Who's to blame? - or the most amazing multithreading bug which wasn't one!



Typically bug-hunting is quite simple, as not to say trivial. But sometimes you encounter some species of bugs that makes you mad for weeks!!! Programming pundits maintain that this are the C++ multithreading bugs, because locks, well..., um..., they just don't compose! So what are we supposed to do? Either we should wait for transactional memory support, or switch to another language, preferably Haskell or Clojure. OK, will do that, but in the meantime there are products to be developed, delivered and sold*. So that's what happened:

1. The Bug

My current project is developed on Windows with Microsoft's Visual Studio as IDE and Intel's XE as a plug-in compiler. We choose Intel XE because it is (unsurprisingly) very good at optimizing code for the Intel platform and, not least, because of its support for the new C++11 standard. Our code is full of boost::thread, boost::mutex, std::atomic's and its ilk, so it falls in the category the gurus are warning of. But - the architecture of the system was recycled from the previous product, so there's no real use in complaining about the non-composability**, and, as we are in the business of streaming files out, the performance is more important than advanced architectural concepts. I just removed a couple of vtable race-conditions from the code and the whole stuff seemed to be running well enough!

But then we changed the Boost version, updated the compiler version, at the same time I made some changes to the threading model to make the internal architecture more dynamic, and booom!, everything stopped working in a huge crash! 


It was "Illegal instruction" and sometimes "Privileged instruction" all the time. What? I'm not issuing assembler instructions in my code! This is definitely something where I can shift the blame on the compiler! On the other side, the C++ standard says that data races constitute the dreaded undefined behavior, in short, that all your bets are off. So why not an illegal instruction? But then there was another type of crash: "Access violation". OK, that's something you could definitely blame on your programming or a data race.

Let me put one thing straight - the crashes came only when the debugger was attached; without debugger the process run without a hitch. But unfortunately this fact doesn't mean very much in the art of debugging multithreading applications - debugger changes program's memory boundaries, ant maybe only accelerates the manifestation of bugs that are otherwise hiding in dark corners of code wanting to become big, ugly Heisenbugs!

First I suspected another vtable race condition and indeed fixed some more instances of it, but that didn't change the overall picture! Next, after inspecting the assembly code following problems were revealed:

First, there were crashes in code like that:
5DF3B12C mov dword ptr [ebp-38h],edi 
5DF3B12F mov dword ptr [ebp-3Ch],esi 
5DF3B132 mov dword ptr [ebp-40h],ebx 
5DF3B135 mov dword ptr [this],ecx 
5DF3B13B mov eax,dword ptr [___security_cookie (5E054DC0h)] 
5DF3B140 xor eax,ebp 
5DF3B142 mov dword ptr [ebp-14h],eax 
5DF3B145 mov dword ptr [ebp-1E0h],eax 
5DF3B14B mov byte ptr [ok],1
where debugger was pointing at mov byte ptr [ok],1 as illegal instruction! What is here illegal! No advance was possible in this case.

But more often I could see something like:
5DF3B152 mov eax,dword ptr [input_size] 
5DF3B155 db 89h 
5DF3B156 test eax,edx 
5DF3B158 db feh 
5DF3B159 db ffh
here the current instruction was, quite understandable, db feh. The problematic part was that before the crash this code looked completely different, i.e. there wasn't a trace of the suspicious db instructions! Well, that way or another, it looked like some dangling pointer problem where memory gets overwritten at some arbitrary address. Or does it? On Windows you cannot overwrite the program section of the process, as there is the code segment protection enabled by default (at least nowadays). Whatever! - I tried to set a data breakpoint ate the changed location, but no change at the given address was reported by the debugger. What the heck, I see that my assembly codes changed, how can debugger be so stupid! So there wasn't any progress on this avenue either.

The second general class of errors was the one causing access violation. It emerged, than an innocuously looking stack pointer manipulation like add esp,0FFFFFFF0h would produce illegal stack pointer value blowing the program off, even though the previous ESP value was well in the allowed range! Looks like sorcery!
5FB4B906  cmp         eax,0BCh
5FB4B90B  jae         5FB4BC25
5FB4B911  add         esp,0FFFFFFF8h
5FB4B914  lea         eax,[conv]
5FB4B91A  mov         dword ptr [esp],2
with register contents like that:
EAX = 322CE9AC EBX = 1BB16DE0 ECX = 359A56A0 EDX = 00000021 ESI = 322CED30 EDI = 322CEBB4 EIP = 5FB4B91A ESP = FFFFFFFC EBP = 322CEB84 EFL = 00010286
Well, with stack pointer corrupted, the result of this was:


Here's another beauty - debugger reported a stack corruption here, as it was completely bewildered with what was going on:



So how was this bug to be handled, and who shall win the blame game?

2. The Explanation.

I tried to inspect the memory directly in the memory window of the Visual Studio debugger, but it was somehow too low lever for my mood at that moment. So because:
  • the problems always occurred in the vicinity of a break point,
  • the debugger was always displaying correct values of the operands in the "Access violation" case,
  • the debugger was never showing the illegal instructions, only after the crash happened,
I decided that the single explanation left was that it's something to do with the tools. Therefore I duly reinstalled everything in my toolchain, Visual Studio, Intel compiler, Boost libraries, in the vague hope that the installation of some of them could be broken. Alas, nothing seemed to help.

At this point, I declared the issue as solved - I had some real work to be done, and if you didn't define excessively many breakpoints, the debugger typically played to the rules. Except when it really mattered of course (sigh).

I explained the problem to my worthy colleague Bernhard Z. so he could take it up, and he investigate it a little bit further. And indeed - he found out some very interesting facts about how exactly the debugger has been guilty!!! 

Contrary to what Visual Studio is telling you, it is possible to attach to a running process with two debuggers (!!!). It is done like that - first attach to it with Studio, and then start WinDebug (don't forget to check the "Nonintrusive" option, or it will complain!). Now, Bernhard did it, and he could have an independant view on the machine code!

An than, with the second view on the assembly code, the problem became obvious - the debugger inserted the debug-break instruction (int3 <=> 0xcc) at the wrong position in the machine code - not before the opcode, but 2 bytes behind it, making the arguments following the opcode to illegal instructions! Look here at the assemble code before setting the breakpoint (the line where the breakpoint has to be set is in bold):
00c7461f 89542404        mov     dword ptr [esp+4],edx
00c74623 e888d00000      call    BoostSerialize!save_data_xml (00c816b0)
00c74628 83c410          add     esp,10h
00c7462b 83c4f0          add     esp,0FFFFFFF0h
00c7462e 8d9560ffffff    lea     edx,[ebp-0A0h]
00c74634 83bd74ffffff10  cmp     dword ptr [ebp-8Ch],10h
00c7463b 8d85c0fdffff    lea     eax,[ebp-240h]
00c74641 0f439560ffffff  cmovae  edx,dword ptr [ebp-0A0h]
00c74648 890424          mov     dword ptr [esp],eax
00c7464b 89542404        mov     dword ptr [esp+4],edx
00c7464f e88cca0000      call    BoostSerialize!save_data_binary (00c810e0)
00c74654 83c410          add     esp,10h
00c74657 83c4f0          add     esp,0FFFFFFF0h
00c7465a 8d9560ffffff    lea     edx,[ebp-0A0h]
00c74660 83bd74ffffff10  cmp     dword ptr [ebp-8Ch],10h
00c74667 8d85c0fdffff    lea     eax,[ebp-240h]
00c7466d 0f439560ffffff  cmovae  edx,dword ptr [ebp-0A0h]
00c74674 890424          mov     dword ptr [esp],eax
00c74677 89542404        mov     dword ptr [esp+4],edx
00c7467b e8d0b90000      call    BoostSerialize!save_data_string (00c80050)
And now the assembler code after the breakpoint has been set:
00c74623 e888d00000      call    BoostSerialize!save_data_xml (00c816b0)
00c74628 83c410          add     esp,10h
00c7462b 83c4f0          add     esp,0FFFFFFF0h
00c7462e 8d9560ffffff    lea     edx,[ebp-0A0h]
00c74634 83bd74ffffff10  cmp     dword ptr [ebp-8Ch],10h
00c7463b 8d85c0fdffff    lea     eax,[ebp-240h]
00c74641 0f43cc          cmovae  ecx,esp
00c74644 60              pushad
00c74645 ff              ???
00c74646 ff              ???
00c74647 ff8904248954    dec     dword ptr [ecx+54892404h]
00c7464d 2404            and     al,4
00c7464f e88cca0000      call    BoostSerialize!save_data_binary (00c810e0)
00c74654 83c410          add     esp,10h
00c74657 83c4f0          add     esp,0FFFFFFF0h
00c7465a 8d9560ffffff    lea     edx,[ebp-0A0h]
00c74660 83bd74ffffff10  cmp     dword ptr [ebp-8Ch],10h
00c74667 8d85c0fdffff    lea     eax,[ebp-240h]
00c7466d 0f439560ffffff  cmovae  edx,dword ptr [ebp-0A0h]
00c74674 890424          mov     dword ptr [esp],eax
00c74677 89542404        mov     dword ptr [esp+4],edx
00c7467b e8d0b90000      call    BoostSerialize!save_data_string (00c80050)
Do you see it now? In Visual Studio's debugger you wouldn't see anything because Visual Studio hides debug-breaks in the disassembly view! That's the typical Microsoft attitude - we know better what you'd like to see - and they are right with it in about 95%. ... until it bites you!

The second type of crash can now be explained too: 0xcc overwrote the correct operand of the add esp instruction, the addition overflowed, and then we got the access violation!

A "trouble ticket" was issued to Intel: http://software.intel.com/en-us/forums/topic/489007. They didn't believe it of course...

3. The moral of the story

Only after that all became clear we could positively be sure that this wasn't a multithreading bug, and were able to continue the development as before ;).

The moral is simple, and already well known - sometimes you can blame it on your tools. 
Plus perhaps another well known one - you always need a second opinion (i.e. WinDebug's)!

And because it seems to be a good time for some bragging, it wasn't the only "you can blame it on your tools" moment in my programming career - a couple of year ago I diagnosed broken multithreaded exceptions/stack unwinding implementation in a port of the gnu compiler to some obscure UNIX derivative (won't tell the names, however, but it was rather a big Telco company). As a result the whole project had to disable exceptions in the compiler. Don't believe it's possible young Jedi? Yes, it is (or it was at that time - I never needed it again).

--
* some discussion about why locking & mutexes are the scapegoats of programming pundits can be found in the "Is Parallel Programming Hard, And, If So, What Can You Do About It?" book here

** for myself I'd opt for a share-nothing, message-passing, "worker threads"-oriented architecture - but well, nobody asked ;). And that post about "Threads as Workers" isn't still written anyway.

No comments: