Programming Multi-threaded Architectures: Interlocked Operations

Veteran game programmer Philippe Paquet discusses the advantages and implementation of interlocked operations within a multi-threaded environment, in this exclusive Gamasutra technical feature.

Philippe Paquet, Blogger

June 30, 2006

9 Min Read

1. Introduction

Interlocked operations provide a simple mechanism for synchronizing access to variables that are shared by multiple threads. Interlocked operations don't supplant classic synchronization functions like critical sections or mutex objects but complement them as they answer a very specific need.

2. What are interlocked operations?

Interlocked operations are a high-performance way of updating integer-sized or pointer-sized variables in an atomic manner.

Interlocked operations are high-performance as they use hardware-based synchronization techniques rather than operating system level mechanisms. Interlocked operations are implemented as intrinsic functions and are even translated in some cases to a single processor instruction.

Interlocked operations are atomic. An operation is atomic if it is indivisible. During an atomic write, you can't have another thread reading the value that you are writing half way through the write. Similarly, during an atomic read, you can't have another thread changing the value that you are reading half way through the read. If the operations were not atomic, there would be a possibility that a thread will read part of an old value and part of a new value from a variable that another thread is writing to. Additionally, the atomicity of interlocked operations is guaranteed across CPUs.

The table below shows the interlocked operations available in the Windows API. Keep in mind that other APIs have more or less operations available.

Windows API function	Description
_InterlockedAnd	Perform a Boolean And operation on the specified variable
_InterlockedCompareExchange	Conditionally set a variable on the outcome of a comparison with the current value
_InterlockedDecrement	Decrements the specified variable
_InterlockedExchange	Set a variable and returns its prior value
_InterlockedExchangeAdd	Add to a variable and returns its prior value
_InterlockedIncrement	Increments the specified variable
_InterlockedOr	Perform a Boolean Or operation on the specified variable
_InterlockedXor	Perform a Boolean Exclusive Or operation on the specified variable

Because some of the architectures capable of running Linux don't provide the necessary hardware support, interlocked operations are not available in the user side of the Linux API. They are available only on the kernel side and only on specific architectures.

3. The pros of interlocked operations

Compared to classic lock mechanisms, interlocked operations are the most lightweight and efficient synchronization primitives. The advantages are many.

Interlocked operations are directly supported by the hardware in the form of one or many processor instructions. Compiler support is provided in the form of intrinsic functions. That configuration is very cache friendly. The flow in the instruction cache will not be disturbed by using an interlocked operation while the data cache may take a minimal hit, depending on the platform memory model.

There is no wait state with interlocked operations. When you try to acquire a mutex object or a critical section, your thread may have to wait for them to be released while interlocked operations are performed immediately. That specific property is the base of lockless programming. Lockless programming is a model where synchronization is achieved using non locking primitives. Alternative names are or lock-free, non-blocking and wait-free programming. As lockless programming is notoriously difficult, don't expect to make your application totally lockless. Instead, concentrate your efforts on trying to make lockless some of the algorithms you are using. Good candidates are linked lists [Val95] [Har01] and deque [ST04].

There is no user mode to kernel mode transition as for mutex objects. This is a significant advantage as user-mode to kernel-mode transitions are particularly slow operations on most platforms (about 100 times slower than a normal function call on a Pentium 4 [Ric05]).

It is possible and relatively easy to create complex synchronization primitives using interlocked operations. Below is a rudimentary implementation of a critical section API using _InterlockedCompareExchange and _InterlockedExchange. Keep in mind that the following implementation is only an example. You shouldn't write a replacement for the standard critical section API unless you have a very specific need.

#define CS long
#define CS_FREE 0
#define CS_USED 1

void InitializeCS(CS* pCS)

{
// Set the free flag
*pCS = CS_FREE;
}

void EnterCS(CS* pCS)
{
  while (_InterlockedCompareExchange(pCS, CS_USED, CS_FREE)
  == CS_USED)
  {
    // Spin until we see the free flag then acquire
    // the critical section by setting the used flag
  }
}

void LeaveCS(CS* pCS)
{
// Set the free flag
_InterlockedExchange(pCS, CS_FREE);
}

4. The cons of interlocked operations

While interlocked operations are great when you want performance, they are limited in functionality. Boolean operations or bit manipulation are usually not available. However, if you are ready to slightly compromise the non-locking aspect of interlocked operations, it is usually possible to build the missing operations from already available operations. Below is an example of an _InterlockedAnd based on the very useful _InterlockedCompareExchange. That example can easily be transformed in a variety of Boolean or bit manipulation interlocked operations by replacing the calculation of the desired value.

void _InterlockedAnd( int * piDestination, int iAnd)
{
int iOriginalValue
int iCurrentValue;

  // Get the destination and don't access it
  // anymore except through the interlocked operation
  iCurrentValue= *piDestination;

  do
  {
    iOriginalValue = iCurrentValue;

// Calculate the desired value
int iDesiredValue = iOriginalValue & iAnd;

    // Try to perform an exchange
    iCurrentValue = _InterlockedCompareExchange(piDestination,
    iDesiredValue, iOriginalValue);

    // The value prior to potential exchange is returned
    // If the value has changed, repeat the operation
  }
  while (iOriginalValue != iCurrentValue);
}

Because of their limitation to integer-sized or pointer-sized variables, there are cases where you won't be able to avoid critical sections or mutex objects. When you have to update several members of a data structure, you will have to use in most case a more complex synchronization mechanism.

Hardware support is required for interlocked operations. Most of the modern processors implement the required support but that is not a given, especially in the embedded and mobile market segment. In that segment, adding support for such complex operations would push both the cost and the power consumption beyond what is currently acceptable.

The processor instructions used to implement interlocked operations are highly complex instructions. Not only the processor architecture has a performance impact, bus design and cache structure have significant performance implications too. Depending on the conditions, hundreds to thousands of cycles will be necessary to complete an interlocked operation. Remember, the fastest synchronization primitive is the one that you are not using.

A last limitation of interlock operations lies with inter-process synchronization. As processes have separate memory spaces, inter-process synchronization can only be achieved with mutex objects.

5. Debugging tips

5.1 Don't mix synchronization mechanisms

Mixing interlocked operations with any other synchronization mechanism is a fast track to synchronization bugs. Synchronization mechanisms have not been designed to be mixed. If you are using a critical section to synchronize a data structure, don't use an interlocked operation to access part of your data structure.

5.2 Don't confuse interlocked operations with volatile variables.

Confusing interlocked operations with volatile variables can lead to hard to track bugs. The volatile keyword only ensures that variables are not kept in registers. It doesn't guarantee atomic interactions across threads and the C++ standard even allows compliant compilers to completely ignore the volatile keyword. Additionally, if you access a volatile variable through another variable (a pointer for example) that is not qualified as well as volatile, the program behavior is undefined.

5.3 Beware of memory consistency

Depending of the hardware and the compiler, interlocked operations may or may not guarantee the memory consistency. If no guarantee is given, both the compiler and the hardware can reorder instructions around an interlocked operation and strict program order will have to be explicitly enforced.

If your program writes to memory and then uses what has been written in an interlocked operation, you need to commit outstanding memory accesses before performing the interlocked operation. This prevents both the hardware and the compiler from reordering accesses around the interlocked operation, causing a synchronization bug. This process is called a release memory barrier or release memory fence.

If your program uses an interlocked operation to read a value and subsequently access it thru memory, you need to commit outstanding memory accesses after performing the interlocked operation. Again, this prevents both the hardware and the compiler from reordering accesses around the interlocked operation, causing a synchronization bug. This process is called an acquire memory barrier or acquire memory fence.

The table below shows the memory barrier functions available in the Windows API.

Windows API function	Description
_ReadBarrier	Forces memory reads to complete
_WriteBarrier	Forces memory writes to complete
_ReadWriteBarrier	Block the optimization of reads and writes to global memory

You should note that the x86 implementation of the Windows API provides a full fence / full memory barrier on interlocked operations.

6. References

[Ric05] Jeffrey Richter. Concurrent Affairs - Performance-Conscious Thread Synchronization. In MSDN Magazine, October 2005.

[Val95] John D. Valois. Lock-Free Linked Lists Using Compare-and-Swap. In Proceedings of the 14th ACM Symposium on Principles of Distributed Computing, August 1995.

[Har01] Timothy L. Harris. A Pragmatic Implementation of Non-Blocking Linked Lists. In Proceedings of the 15th International Symposium of Distributed Computing, October 2001.

[ST04] Håkan Sundell, Philippas Tsigas.Lock-Free and Practical Deques using Single-Word Compare-And-Swap. Technical Report no. 2004-02.

About the Author(s)

Philippe Paquet

Blogger

Philippe Paquet is Technical Director for The Collective, Inc. He is an industry veteran who started programming in the early eighties on the Apple II. Before joining The Collective, Inc., Philippe was Core Technology Manager for Reflections Interactive Ltd. and Technical Director for Atari. His most recent titles include Driv3r and Stuntman. He can be reached at [email protected].

See more from Philippe Paquet

Related Topics

Related Topics

Recent in More

Related Topics

Programming Multi-threaded Architectures: Interlocked Operations

About the Author(s)

Latest News

Trending

Cooking Games Spotlight: Deep Dives, Interviews, and More

Featured Blogs

Game Developer Essentials

Related Topics

Related Topics

Recent in More

Related Topics

<span class="ArticleBase-LargeTitle">Programming Multi-threaded Architectures: Interlocked Operations</span>Programming Multi-threaded Architectures: Interlocked Operations

About the Author(s)

Latest News

Trending

Cooking Games Spotlight: Deep Dives, Interviews, and More

Featured Blogs

Game Developer Essentials

Programming Multi-threaded Architectures: Interlocked Operations