Debugging Concurrency
Paquet, the Core Technology Manager for Driv3r and Stuntman creator Reflections, discusses the tricky and increasingly important issue of debugging code on multi-core processors, commenting that, despite the performance advantages: "Unfortunately, with multithreaded applications come a new class of bugs which challenge even the most experienced programmers."
1. Introduction
In July 2000 Apple brought dual-processing to its PowerMac line and symmetric multiprocessing (SMP) became officially mainstream. Less than 4 years later, Intel introduced a feature called "Hyperthreading" to their IA-32 processor line and simultaneous multithreading (SMT) became mainstream too. Before the end of the year, we will see Intel, AMD and Microsoft introducing multi-core processors to the masses. Multithreaded applications are now a fact of life and probably the only way toward maximum performance. Unfortunately, with multithreaded applications come a new class of bugs which challenge even the most experienced programmers.
As a simplification, I will only talk about threads within a single system and I will limit resources to memory. Keep in mind that any of the conditions described below can occur in distributed systems and can involve any part of the hardware, including but not limited to the GPU and the network connections.
2. Race Condition or Race Hazard
A race condition, sometimes called a race hazard, occurs when data can be accessed by two or more different threads simultaneously and there is no mechanism to control temporal ordering of access. Depending on which thread accesses the shared data first, the behavior of the program change.
In the following example program, the printed value of g_iResult depends on which thread will execute his assignment first:
If ThreadOne assigns 50 to g_iResult before ThreadTwo is able to assign 150 to g_iResult, then g_iResult will contain 150 at the time it is printed.
If ThreadTwo assigns 150 to g_iResult before ThreadOne is able to assign 50 to g_iResult, then g_iResult will contain 50 at the time it is printed.
If ThreadOne and ThreadTwo were executed separately, we would have a constant result. The unpredictability is caused by the race between the two threads.
//
// Race.cpp
//
// Example of a race condition
//
//
#include <windows.h>
#include <stdio.h>
#include <process.h>
int g_iResult = 100;
bool g_bThreadOneFinished = false;
bool g_bThreadTwoFinished = false;
void ThreadOne(void*)
{
// Wait some random amount of time
Sleep(rand());
// Set the result
g_iResult = 50;
// Finished
g_bThreadOneFinished = true ;
_endthread();
}
void ThreadTwo(void*)
{
// Wait some random amount of time
Sleep(rand());
// Set the result
g_iResult = 150;
// Finished
g_bThreadTwoFinished = true ;
_endthread();
}
int main()
{
// Start the threads
_beginthread(ThreadOne, 0, NULL);
_beginthread(ThreadTwo, 0, NULL);
// Wait for the threads to finish
while (( false == g_bThreadOneFinished)
|| ( false == g_bThreadTwoFinished))
{
Sleep(1);
}
// Print the result
printf("Result: %i\n", g_iResult);
}
At a high level, the most common symptom of a race condition is the non-determinacy of the program behavior. At a lower level, it is the unpredictability of values stored in variables shared between multiple threads.
Race conditions are notoriously hard to debug and test for because they often occur in highly unlikely situations. Symptoms sometimes just disappear when a debugger is attached or when trace messages are added. This phenomenon is known as HeisenBug (from Heisenberg's Uncertainty Principle in quantum physics).
As source-level debuggers usually don't help much (they directly affect program timings), your best bet is to use trace messages. Add trace messages around critical resources usage, log everything and check whether the order of events is what you are expecting.
Architecturally, a commonly used solution is to give one unique thread exclusive access of a “would be shared” resource and have every other thread accessing that “would be shared” resource using inter-thread communication.
3. Deadlock and Livelock
A deadlock is a condition that arises when two or more threads are waiting for circularly locked resources. In multithreaded programs, deadlocks are not unusual as lock mechanisms are a fundamental part of parallel programming.
A livelock is very similar to a deadlock. The difference being that, while in a deadlock two or more threads are stalled waiting for each other, they will not be stalled in a livelock and will constantly be trying to acquire various resources and dynamically lock each other.
Sometimes, a deadlock can occur within the same thread. That condition is called a self deadlock. Even if a self deadlock can present the same symptom as a deadlock, it is actually not a true deadlock.
The following example shows a classic deadlock case. Consider the sequence of actions on the two threads:
• ThreadOne acquires g_hMutexOne
• ThreadTwo acquires g_hMutexTwo
• ThreadOne blocks attempting to acquire g_hMutexTwo