Adi Levin's Blog for programmers

June 4, 2009

Locking mechanisms

This post discusses four locking mechanisms that are essential for multithreaded programming:

  1. Mutex.
  2. Semaphore.
  3. Critical Section.
  4. Interlocked functions.

Mutex : Provides MUTually EXclusive access to a resource

A Mutex is an object that is used to protect a resource (file / variable / etc…) from being accessed simultaneously by two different threads. Mutexes can also be used to synchronize two different processes, as is demonstrated by the following simple code for a console application: (Download this program)

#include <string.h>
#include <stdio.h>
#include <windows.h>

void main()
{
    HANDLE m = CreateMutex(NULL,FALSE,”adi”);
    char c[20] = “multithreading “;
    for(int iter = 0; iter<100; ++iter) {
        WaitForSingleObject(m,INFINITE);
        for(int i=0;i<(int)strlen(c);++i) {
               for(int j=0;j<10000;++j)
                     printf(“%c”,c[i]);
        }
        ReleaseMutex(m);
    }
    CloseHandle(m);
}

This simple program prints the letters of the word “multithreading “, over and over again (100 times). Each character is printed 10000 times. Things get interesting when you run two (or more) instances of the program simultaneously. You’ll notice that at any given moment, only one of the instances is printing, while the other instances are waiting for their turn. This is achieved as follows:

The command CreateMutex creates a named mutex object. In case a mutex with the given name already exists, it simply returns a handle to the existing mutex. This way, different processes obtain handles to the same mutex object.

Don’t confuse the variable “m” with the actual mutex object. “m” is only a handle (a proxy) to the actual object. In each process, that handle may have a different value. A HANDLE is simply a pointer (void*), that is used to access the object by Windows API functions. Notice the call to CloseHandle(m) at the end of the function. It is important not to leave open handles to objects that you don’t need any more. CloseHandle(m) doesn’t mean that the mutex object is destroyed. It will only be destroyed when it is no longer needed, i.e. when there are no more open handles to it.

A mutex object is either owned by a thread, or it not owned by any thread. Initially, after the call to CreateMutex, the mutex is not owned by the calling thread (because we sent FALSE in the second argument). The call to WaitForSingleObject(m,INFINITE) cuases this thread to sleep until “m” becomes signaled, which will happen when no other thread owns the mutex object. The first time we reach this point, the mutex is not owned by any thread, so WaitForSingleObject returns immediately. When it returns, the mutex is owned by the calling thread. Now, other threads that are trying to obtain ownership over the mutex, will have to wait, until the owning thread calls ReleaseMutex.

Tips on how to use Mutexes

As the example above shows, multiple processes can have handles to the same mutex object, for inter-process synchronization. However, it is more common to see different threads in the same process using mutex objects for synchronization. In this case, the mutex doesn’t need to be named. Different threads can share the same handle. To create an unnamed mutex, call CreateMutex with a NULL name.

To access an existing named mutex, CreateMutex can be used, but OpenMutex is preferable (because it requires less privileges).

If a thread owns the mutex whose handle is “m”, then subsequent calls to WaitForSingleObject(m,…) will not block the thread. However, the thread must call ReleaseMutex once for each time that the mutex satisfies a wait.

If a thread that owns a mutex exits without releasing the mutex, it becomes abandoned. It is possible to obtain ownership of an abandoned mutex, but you should use this information as an indicator that an error has occured.

Mutex is good for nested function calls

Suppose there is a set of functions that access a resource, and all use the same mutex object for synchronization. By that we mean that there is a WaitForSingleObject at the beginning of each function and ReleaseMutex at the end. What happens if they call one-another? What happens in case of a recursive call?

Good news! Everything works well, because function calls are invoked in the same thread as the caller. If function A calls function B, then the WaitForSingleObject in function B will return immediately, since A and B are invoked in the same thread, and the mutex belongs to that thread already.

Be aware that other methods of synchronization, such as Semaphores and Spinning_mutex  (see below) are not adequate for this scenario. In the event of nested function calls, they will create a dead-lock!

Semaphore: A counter for available resources

In some cases, we want to allow simultaneous access to a certain resource, but we want to limit it to a certain number of threads or processes accessing it at the same time. A semaphore does exactly that. The following code example is a variant of the Mutex example above: (Download this program)

#include <string.h>
#include <stdio.h>
#include <windows.h>

void main()
{
HANDLE s = CreateSemaphore(NULL,2,2,”adi”);
    char c[20] = “multithreading “;
    LONG previous_count;
    for(int iter = 0; iter<100; ++iter) {
        WaitForSingleObject(s,INFINITE);
        for(int i=0;i<(int)strlen(c);++i) {
               for(int j=0;j<10000;++j)
                     printf(“%c”,c[i]);
        }
        ReleaseSemaphore(s,1,&previous_count);
    }
    CloseHandle(s);
}

As in the Mutex example, this program outputs characters to the console windows. But what happens when you run multiple (more than two) instances of this program? You’ll notice that at any moment, exactly two of the windows are running, while the others are waiting for their turn. This is how it’s done:

The call to CreateSemaphore(NULL,2,2,”adi”) creates a named semaphore object, with a current counter value 2, and the maximum counter value 2.  Think of the counter as representing the number of available resources. When a semaphore satisfies a wait (i.e. after WaitForSingleObject(s,INFINITE) returns), its count is decreased by 1. When its count is zero, WaitForSingleObject(s,INFINITE) will block the thread until another thread release the semaphore, thereby increasing its count.

In this example, the call to ReleaseSemaphore increases the count by 1 (hence the “1” in the second function argument), but it can also increase the count by an aribtrary number, which represents a number of owned resources becoming available.

A more interesting example of using semaphores for synchronization is given in my post on threads barrier.

Differences between Mutex and Semaphore

The most obvious differnece is the a Mutex is a lock/unlock mechanism, while a semaphore is a counter that allows (but limits) simultaneous access to resources.

Another important difference, is that a semaphore is not owned by a thread. It can also operate in single-threaded situations, to enable different modules, or different sections of the code to access a resource (of course, if there is no multithreading at all, there is no need to use a semaphore – a simple counter will do the trick).

Critical Section

A critical section is a mechanism for protecting a piece of code from being executed concurrently from different threads. Unlink a Mutex, a critical section works only inside a process, and cannot be used for inter-process synchronization. It can only be used for  synchronizing threads inside the process. Like a Mutex, at any time, it can be free or owned by a single thread. The advantage of a critical section is that it is faster.

Here is sample code, from Wikipedia:

 #include <windows.h>

static CRITICAL_SECTION cs;
InitializeCriticalSection(&cs);

void f()
{
    /* Enter the critical section — other threads are locked out */
    EnterCriticalSection(&cs);

    /* Do some thread-safe processing! */

    /* Leave the critical section — other threads can now EnterCriticalSection() */
    LeaveCriticalSection(&cs);
}

/* Release system object when all finished — usually at the end of the cleanup code */
DeleteCriticalSection(&cs);

If no other thread currently owns the critical section, EnterCriticalSection will return immediately. Otherwise, it will only return when the thread owning the critical section calls LeaveCriticalSection. You also have the function TryEnterCriticalSection that always returns immediately, and lets you know whether you entered the critical section successfully or not (in case another thread owns the critical section). Important notice: If the thread terminates before LeaveCriticalSection is called, any other thread reaching EnterCriticalSection will be dead-locked. This is unlike a Mutex. If a thread ends before the Mutex is released, the Mutex becomes abandoned, and other threads can own it.

This behavior of critical sections is one reason why you should not use TerminateThread to end a thread: Always let your threads end normally, i.e. by returning from the thread function. In particular, malloc and free use a critical section to synchronize heap allocations from different threads. Therefore, any call to TerminateThread on a thread that uses heap allocation can, potentially, cause your application to deadlock. Another reason for not using TerminateThread is that it does not free the memory allocated for the thread’s stack (1Mb normally), so you get a memory leak.

Interlocked functions

The locking mechanisms discuessed above are important and you should learn how to use them. However, you should avoid using them whereever they are not necessary, in order not to compromise the scalability of your program (i.e. how faster the program becomes when the number of processor increases). If you care about performance, you should always minimize the number of synchronizations in your code.

For instance, if you have a large array of numbers, it will be a bad idea to create a mutex object for each entry of the array, because this means you will have lots of mutex objects. Also, if you need to access the array a lot of times, it can be a bad idea to lock the array using a mutex on the entire array, for every individual access. Your threads will be in waiting state most of the time, and your processor will be busy context-switching.

An efficient mechanism for lightweight synchronization of small operations is provided by Interlocked Functions. Let’s examine the following example: (download this program)

#include <windows.h>
#include <iostream>

#define N 100000000
 
static DWORD WINAPI unsafe_increment(void *param)
{
   volatile long* x = (volatile long*)param;
   for(int i=0;i<N;++i)
      ++(*x);
   return 0;
}
 
static DWORD WINAPI safe_increment(void *param)
{
   volatile long* x = (volatile long*)param;
   for(int i=0;i<N;++i)
      InterlockedIncrement(x);
   return 0;
}

void main()
{
   volatile long x = 0;

   HANDLE h[2];
   DWORD thread_id;
   h[0] = CreateThread(NULL,1<<24,unsafe_increment,(void*)&x,0,&thread_id);
   h[1] = CreateThread(NULL,1<<24,unsafe_increment,(void*)&x,0,&thread_id);
   WaitForMultipleObjects(2,h,TRUE,INFINITE);
   CloseHandle(h[0]);
   CloseHandle(h[1]);

   std::cout << “Result of unsafe increment: ” << x << “\n”;

   x = 0;

   h[0] = CreateThread(NULL,1<<24,safe_increment,(void*)&x,0,&thread_id);
   h[1] = CreateThread(NULL,1<<24,safe_increment,(void*)&x,0,&thread_id);
   WaitForMultipleObjects(2,h,TRUE,INFINITE);
   CloseHandle(h[0]);
   CloseHandle(h[1]);

   std::cout << “Result of safe increment: ” << x << “\n”;
}

This program starts by setting x=0 and then increments it 200,000,000 times, and prints the result, which should be exactly 200,000,000. It does so in two different ways – a safe way and an unsafe way.

The unsafe algorithm is to run unsafe_increment in two threads in parallel. Each time it runs, it performs 100,000,000 increments of the variable x, using operator++. The safe algorithm uses safe_increment, in which we use the function InterlockedIncrement instead of operator++.

The result of the unsafe algorithm is undeterministic, because operator++ is not an “atomic” operation. It can be described as three atmoic operations:

  1. Load the value of x into a register (mov edx,dword ptr [eax])
  2. Increment the register (inc edx)
  3. Set the value of x by the register (mov dword ptr [eax],edx)

The problem is that one thread can load the same value of x that the other thread has loaded, before the other thread completed the increment operation. This means that some increment operations get lost.

The function InterlockedIncrement is designed to solve this specific problem. It locks the memory bus until the operation ends. This way, it because an atomic operation. There is a collection of interlocked functions, for performing different small operations such as incrementing, compare-exchange etc…

Spinning Mutex

Because a Mutex is a kernel object, it is not efficient and should be used as little as possible, when performance is a concern. We can use interlocked functions (that are purely user-mode functions) to build a light-weight mutex mechanism called a spinning mutex, which can be more efficient in some situations.

Suppose you have a large array and several threads access it in parallel. The chance of two threads hitting the same entry of the thread at the same time is small, so it won’t make sense to block access to the entire array when one thread accesses an entry of it. It may be better to have a light-weight locking mechanism that works per-entry.

The class spinning_mutex is suited for that job:

spinning_mutex.h:

class spinning_mutex
{
public:
 spinning_mutex() : m_state(1)

 void lock()
 void unlock();

private:

 volatile long m_state;
};

 spinning_mutex.cpp:

#include “spinning_mutex.h”
#include <windows.h>

void spinning_mutex::lock()
{
 do {
  long prev_state = InterlockedExchange(&m_state,0);
  if (prev_state==1)
   break;
 } while (true);
}

void spinning_mutex::unlock()
{
 InterlockedExchange(&m_state,1);
}

It is light-weight, because it only takes 4 bytes (the member m_state), so you can define a spinning_mutex for each entry of the array. When m_state is 1, the mutex is released, and a lock can be obtained. When it is zero, it means that it has been locked by someone else.

The function lock() changes m_state to 0, but exits only if its previous value was 1. Otherwise, it continues in an endless loop until another thread called unlock() which sets m_state to 1. The function unlock() changes m_state to 1. This is different from ordinary Mutexes by the fact that the blocked thread does not enter a waiting state – it uses polling to wait for the lock to be released, taking CPU time. This is efficient only if the chance of this happening is small.

It is possible to add a Sleep() command inside the do-while statement of lock(), so that the blocking thread will enter a sleeping state and not waste CPU time while waiting. In some cases, it could result in better performance. The disadvantage of this is that you might Sleep longer than needed, and that Sleep requires a kernel call and so is not so efficient.

A more scalable implementation of Spinning_Mutex

The above implementation of spinning_mutex::lock() does not scale well, on most common processor architectures, when the number of threads increases. The reason is that the call to InterlockedExchange invalidates the cache line and results in an expensive access to global memory. When several threads spin on a shared spinning_mutex, they repeatedly access the global memory via the bus and cause unnecessary traffic on the bus. Don’t forget that the bus is a finite resource shared by all processors.

Here is a simple modification that makes spinning much more efficient:

void spinning_mutex::lock()
{
 do {

  while (m_state==0) {}
  long prev_state = InterlockedExchange(&m_state,0);
  if (prev_state==1)
   break;
 } while (true);
}

When a thread is waiting for the lock to be released, it executers the line  

     while (m_state==0) {}

As long as the value of m_state is unchanged, checking the value of m_state results in a fast access to the processor’s cache, and does not cause any bus traffic. Once another thread releases the lock, the cache line holding m_state is invalidated, and then checking m_state results in a single access to global memory. Experiments show that this imlpementation scales much better when increasing the number of threads spinning on the same lock.

Spinning_Mutex is not good for nested function calls!

Please refer to my comment on nested function calls in the section on Mutex above. There, I explained how a Mutex object is good for protecting a set of functions that access a resource, even if they call one-another, or make recursive calls.

Bad news! Spinning_Mutex is not good in this scenario. If function A has a lock() at the beginning and an unlock() at the end, and it calls function B, then the lock() in function B will never return, creating an infinite loop.

In the case of nested function calls – use Mutex objects and not spinning_mutex.

Advertisements

3 Comments »

  1. Correction: “Unlink a Mutex” should be “Unlike ..”

    Im confused, at the beginning I thought you are a Mutex – after reading HANDLE m = CreateMutex(NULL,FALSE,”adi”);
    But then I saw that you are a Semaphore – after reading HANDLE s = CreateSemaphore(NULL,2,2,”adi”);
    What I ment to say is that this article is clearly written !!
    Much better than those I read while I was a student.

    I didn’t understand the Spinning_Mutex mechanism. How is it connected to the array I am working with ?

    Comment by Doron Osovlanski — September 7, 2009 @ 2:44 pm | Reply

    • Thanks Doron.
      The way I recommend to use the spinning mutex with the array, in the last example of this post, is the following: If you have an array X of N elements, define another array M of N spinning mutexes. Then, whenever you want to access X[i], call M[i].lock() and after you finish using it, call M[i].unlock(). This way, the lock is per-element. Different threads accessing different elements of the array will not interfere to one-another. Is this better understood now?

      Comment by adilevin — September 7, 2009 @ 2:54 pm | Reply

  2. […] The function threads_barrier starts by waiting for all threads to enter the barrier. We use interlocked functions to increment a counter num_of_threads_that_entered_barrier. If it is below the number of threads, we wait. If we reach the given number of threads, we know that we have (num_of_threads-1) threads that are waiting, so we release them by a call to ReleaseSemaphore. A similar mechanism is used in the exit. For an explanation of interlocked functions and semaphore, please refer to my post on locking mechanisms. […]

    Pingback by Threads Barrier « Adi Levin's Blog for programmers — January 23, 2010 @ 9:59 am | Reply


RSS feed for comments on this post. TrackBack URI

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Create a free website or blog at WordPress.com.

%d bloggers like this: