Adi Levin's Blog for programmers

December 13, 2010


Filed under: Multithreading,programming — Adi Levin @ 6:30 pm
Tags: ,

Multithreaded computations are sometimes non-deterministic, meaning that they produce different results in different runs that have the exact same input. In some cases, this is a result of insufficient synchronization between threads, causing Race Conditions (A Race Condition is actually defined as a critical non-deterministic behavior). In other cases the non-deterministic behavior is not critical, but still we would like to avoid it, because debugging  a non-deterministic program is more problematic.

In the following, I discuss two common reasons for non-detemrinism:

Order of floating-point computations

Let’s consider a multithreaded program that performs summation of an array of floating point numbers. Suppose that each thread computes the sum of a part of the array, and at the end the partial sums are summed together. The order at which the numbers are summed is imporant. In floating-point arithmetic, (a+b)+c is not always equal to a + (b+c). Since the number of running threads can be different from one run to another, and the timing of threads is unpredictable, the order in which the summation is done may be non-deterministic, depending on how the program is written.

One way to avoid this problem, if given the choice, is to use integer arithmetic instead of floating-point.

Another way is to divide the task into a constant number of partial sums that do not depend on the number of threads, then store the intermediate result of each task in an array, and finally sum-up that array in one thread. This way, the order at which numbers meet each-other is not random.

Order of insertion to a container

In a fork-join multithreading model, it is often useful to collect the outcome of tasks in a thread-safe container. After the multithreaded fork-join part is completed, the results in the container are used as input to the next part of the program. The problem is that insertion into the container can happen in an unexpected order. This causes the next part of the program to behave in a non-deterministic way.

One way to solve this problem is to add a sort operation after fork-join part. Simply sort the data in the container, and this will guarantee a consistent input to the next stage of the program.

Another way to solve this is to write the code in a way that each task result knows a-priori its place in the container – which is independent of timing and of the number of running threads.

January 23, 2010

Threads Barrier

Filed under: Multithreading — Adi Levin @ 9:23 am
Tags: , , ,

The purpose of a Threads Barrier is to synchronize a collection of threads. Many parallel computing programs are implemented by the SPMD model (Single Program Multiple Data). In this model, a number of threads (or computers) run the same function (or program), but each thread is responsible for handling a different part of the data.  

This is also the model that is used by OpenMP programs. OpenMP invokes a number of threads that run the same piece of code, where each thread processes a different part of the data. 

In such programs we often need to make sure that all threads arrive at a certain point of the code together. We call such a point a Barrier. 

A double-sided Barrier with callbacks 

In the following, I describe a specific implementation of Barrier, that has the following properties: 

  1. The barrier is double-sided. This means that it has an entrance and an exit. We first wait until all threads enter the barrier, and then wait until all threads exit the barrier.
  2. When all threads are inside the barrier, we can run two callback functions – one that is invoked by a single thread only, and one that is invoked by all threads.

The advantage of the two-sided barrier over a one-sided barrier is that 

  1. In a one-sided barrier, that only waits until all threads pass a certain point, it is more difficult to make sure that the barrier object is reusable. To reuse the barrier, we need to make sure that all threads have exited the barrier. In a two-sided barrier we handle that explicitly, so the barrier can be reused freely.
  2. Callbacks are more useful in the two-sided barrier – because there is a period in time when all the threads are guaranteed to be inside the boundaries of the barrier (between the entrance and the exit). Functions that are used as callbacks during that time can safely make assumptions about the position of other threads.

Initialization of the barrier

    volatile long num_of_threads_that_entered_barrier = 0;
    volatile long num_of_threads_that_exited_barrier = 0;
    HANDLE entrance_semaphore = CreateSemaphore(NULL,0,4096,NULL);
    HANDLE exit_semaphore = CreateSemaphore(NULL,0,4096,NULL);

The barrier main function

void threads_barrier(int num_of_threads

    LONG prev; 

    // Wait until all threads enter the barrier 
    if (InterlockedIncrement(&num_of_threads_that_entered_barrier)<num_of_threads)
    else { 
        num_of_threads_that_exited_barrier = 0;


    // Wait until all threads exit the barrier 
    if (InterlockedIncrement(&num_of_threads_that_exited_barrier)<num_of_threads)
    else {
        num_of_threads_that_entered_barrier = 0;


The function threads_barrier starts by waiting for all threads to enter the barrier. We use interlocked functions to increment a counter num_of_threads_that_entered_barrier. If it is below the number of threads, we wait. If we reach the given number of threads, we know that we have (num_of_threads-1) threads that are waiting, so we release them by a call to ReleaseSemaphore. A similar mechanism is used in the exit. For an explanation of interlocked functions and semaphore, please refer to my post on locking mechanisms.

The function single_thread_callback() is invoked in only one thread (the last one to enter the barrier), before the call ReleaseSemaphore.  Thus it can assume that the other threads are not doing anything during that time. This makes it suitable for running initialization code that accesses shared data.

The function all_threads_callback() is invoked in all threads, after all have entered and before any threads has left the barrier. This makes it suitable for running thread-specific initialization code that only accesses thread-local data.

To guarantee reusability of the barrier, we clear the value num_of_threads_that_exited_barrier to zero only after all threads have entered the barrier. This way we know that no thread is currently in the process of exiting the barrier. In the same way, we only clear the value of num_of_threads_that_entered_barrier after all threads have exited the barrier.

September 19, 2009

Ending a thread

Filed under: Multithreading — Adi Levin @ 1:59 pm
Tags: , , ,

There are three ways to end a thread: Return from the thread function, call ExitThread from within the thread, and call TerminateThreadfrom any thread. The best thing to do is return from the thread function. This way, you guarantee that all destructors of automatic variables have been called, and that thread termination is done properly.

TerminateThread – never use it!

BOOL WINAPI TerminateThread(
  __inout  HANDLE hThread,
  __in     DWORD dwExitCode

Some programmers make the mistake of waiting with a timeout until a thread is finished, and then kill it using TerminateThread in case of timeout. Don’t do that! TerminateThread terminates the thread immediately without doing the proper clean-up and notifications. It doesn’t free the stack (1Mb leak), it doesn’t free owned critical sections, it doesn’t call the DLL entry functions to notify them that the thread is being detached. Of course, it doesn’t give the thread a chance to free objects allocated on the heap.

The result is a potential deadlock, and the worst thing is that it happens with low probability, so there is a good chance that software testing will not detect it, and your customer will be the first to encounter it. Why a potential deadlock? Because certain library functions (such as malloc, free and MFC functions), use critical sections to synchronize access of threads within the process to certain resources (such as the heap). This means that whenever you call malloc you are waiting for a critical section to be released. But if you called TerminateThread while that thread owned that critical section, the wait will result in a dead-lock.

In fact, there is no justification to using TerminateThread. It is only safe to use when you control exactly what that thread is doing (e.g. you know that it doesn’t do heap allocation, and it is not inside a critical section). But if you are the programmer of that thread’s function, then why not write code that returns safely from the thread function upon some event or flag, or calls ExitThread when needed?

ExitThread – use it with care!

ExitThread can only be called from within the thread which you want to exit. It frees the stack and calls the entry point of each DLL in the process to notify it that the thread is being detached. However, it does not perform destruction of C++ objects as done when returning from a function. This is why you have to use it with care.

In certain situations, it make sense to use ExitThread for the purpose of simplifying the code. For example, if the thread is in alertable state (i.e. waits for APC functions), other threads can queue an APC function that calls ExitThread, thereby causing the thread to exit.

The best way: Returning from the thread function

The best way to write a thread function is to design it so that another thread (the main thread, or the thread that is managing the worker threads), can cause it to end in an orderly fashion.

 For example, if you run a loop with many iterations, check the value of a certain flag in each iteration, and return if it is non-zero. Other threads can change the value of that flag when they want your thread to finish.


If your thread waits for some object to be signaled, using WaitForSingleObject(obj,…), and you want to be able to stop the wait from another thread, replace the wait function by WaitForMultipleObjects, and wait for obj and for an event that can be signaled by another thread, when termination is needed.

More generally, in any point in the code where you want to be able to interrupt the operation of the thread and cause it to exit, you should place the appropriate code.

September 12, 2009

Windows Thread Pool

Filed under: Multithreading — Adi Levin @ 2:13 pm
Tags: ,

As an alternative to creating threads and explicitly assigning tasks to threads, Windows offers a collection of API functions that manage a pool of threads. The idea is that the operating system will be responsible for the creation and deletion of threads. To execute a certain function on a thread from the thread pool, the programmer doesn’t need to identify the specific thread. The system will automatically manage the queue of functions that should be invoked and the collection (pool) of threads that are used to invoke them.

Significant changes in the Windows API between XP and Vista

In my own experience, the API that is available in Windows XP (now called the legacy thread pool) does not provide enough control over the way that tasks are assigned to threads. In particular, there is only one thread pool, and tasks are processed in a first-in-first-out order. This means that higher priority tasks that arrive later have to wait until low-priority tasks finish their work. Also, the legacy thread pool is confusing because there are two kinds of threads IO and non-IO.

It seems that the guys at Microsoft understood the problems of the legacy thread pool. In Windows Vista and in Windows Server 2008, the thread pool API is significantly improved. In particular, you can construct a number of thread pools, and there is no longer a separation between IO and non-IO threads. Still, if you’re writing code that’s supposed to run on Windows XP, you can’t use the new API.

My personal opinion

I am not a fan of the Windows thread pool. In my own multithreading experience, I started by using the thread pool, because it seemed the easiest thing to do. After a while, I discovered the limitations of the legacy thread pool, and decided I had to switch to a different multithreading API, or to implement my own thread-pool (which I did). Now, even though the new thread pool API seems better, I can’t use it because the programs I write are targeted to both Windows XP and Windows Vista.

I think that some sort of thread pool is a must, but I would not recommend using Windows API for it, because of the above reasons and because other, portable, solutions exist – OpenMP and the Intel Thread Building Blocks. If really can’t stand the idea of using third party libraries, consider implementing your own thread pool mechanism.


An article on the new Thread Pool API –

MSDN article on new and legacy API –

August 30, 2009

Constrained Parallelism

Filed under: Multithreading — Adi Levin @ 9:00 pm
Tags: , ,

This pattern is useful for parallelising complex algorithms, since it enables the programmer to exploit all possible parallelisms in the algorithm.

As input to the pattern, we have N tasks. For each pair of tasks i=0,1,…,N-1 and j=0,1,…,N-1 we are given two pieces of information: Must_Preceed(i,j) and Can_Run_In_Parallel(i,j). When Must_Preceed(i,j)=true, this means that task #j is not allowed to execute before task #i is completed. When Can_Run_In_Parallel(i,j)=true, it means that tasks i and j can run simultaneously in different threads. When it is false, we are not allowed to run them in parallel. The user also defines the number NTHREADS that says how many tasks are allowed to run in parallel at any given time. For scalability, one good policy is to set this number to the number of processors available on the machine.

When the user starts the algorithm, we first run tasks that have no required predecessors, and that are allowed to run in parallel to each-other. Whenever a task ends, we find another task that hasn’t been invoked yet, and that can be invoked given the constraints.

I recommend implementing a generic class that performs this pattern. It forces the programmer to think systematically about his algorithm and analyze its parallel structure.

Implementation using a master thread

The implementation requires NTHREADS worker threads and one extra thread whose job is to allocate tasks to worker threads and to be notified when tasks are finished. At the beginning, we calculate and store, for each task #i, the number NPRED(i) of required predecessors. If NPRED(i)=0, the task is allowed to run, as long as there are no running tasks j for which Can_Run_In_Parallel(i,j)=false.

In the master thread there should be a message loop (see my post on messages and message queues). Whenever a task finishes its work in a worker thread, it sends a message to the master thread using PostThreadMessage. The message says which task has just ended – for example task #i. The master thread recieves the message and decrements NPRED(j) for each j such that Must_Preceed(i,j)=true. After this update stage, it looks for tasks that can be invoked. We simply go over the list of tasks that haven’t been invoked yet, and look for tasks i for which NPRED(i)=0, and that are allowed to run in parallel to the tasks that are already running.

When parallelizing and algorithm using this pattern, it is important not to break the algorithm into too many tasks, because the fact that there is a master thread means that there is management overhead. Notice that the worst-case time complexity is at least O(N^2), in an efficient implementation, because we need to compute Must_Preceed(i,j) and Can_Run_In_Parallel(i,j) for all i and all j. However, if N is at most a few thousands, and the tasks are not trivial, this overhead is negligible.

“Focus-on-task” functionality

“Focus-on-task” is an added functionality that works nicely with this pattern. While the algorithm is running, the user may wish to give highest priority to a specific task #i. In such case, we should invoke only task #i and the tasks that are required for it to run (i.e. the tree of predecessors). After task #i is finished, the algorithm should return to its normal behavior and invoke all tasks until completion.

The implementation of this functionality is a relatively simple extension to the above: When we want to focus on a specific task, we send the master thread a special message (using PostThreadMessage), that instructs it to focus on task #i. Then, the master thread computes all predecessors of task #i, and continues as usual, except that the only tasks that belong to that tree are allowed to be invoked – until task #i is invoked (in which case we return to normal mode).

Implementation without a master thread (only a pool of worker threads is needed)

Suppose we only have the Must_Preceed constraint (i.e, we’re dropping Can_Run_In_Parallel). We can implement this pattern without a master thread. You’ll need a thread pool in which you can spawn tasks dynamically (unlike my thread_team example, in which you can only spawn a predefined collection of tasks).

You start by spawning all tasks that have no predecessors. There are now a number of tasks running in worker threads. Each task needs to be wrapped by a function that performs some the following operation after the task is complete: On the worker thread, when a task is finished, we decrement NPRED(j) for each j such that Must_Preceed(i,j)=true. If NPRED(j) reaches zero, we spawn task #j.

This is a simpler approach, but it doesn’t allow the “Focus-on-task” functionality, and doesn’t accept constraints of type Can_Run_In_Parallel.

Next Page »

Create a free website or blog at