Adi Levin's Blog for programmers

July 24, 2009

Multithreading Patterns #1 – Fork-Join

Filed under: Multithreading — Adi Levin @ 11:33 am
Tags: ,

The purpose of the Fork-Join patten is to parallelize a part of an algorithm through task-parallelism.

“Fork” means that the calling thread asks other threads to perform tasks, instead of performing it sequentially.

“Join” means that the calling thread waits until all tasks have been accomplished by the other threads. In some cases the “Join” operation is more complicated – it requires to merge the results of the different threads.


1. Parallel “for” loop. Each iteration of the loop becomes a task, or a range of iterations of the loop become a task. For an example, see my post on Asynchronous Procedure Call.

2. Parallel reduction – for example, computing the sum of an array. Break the array into N parts. Each task of the N tasks will be reponsible for summing up a part of the array. The “Join” operation means summing up the N results, and waiting for all of the tasks to complete.

3. Parallel activation of two independent parts of an algorithm.

This is a very useful pattern, and it does not require broad changes of the software. It parallelizes a certain segment of code, without requiring changes in other parts of the code.

July 19, 2009

Debugging tips and tricks

Filed under: programming — Adi Levin @ 1:06 pm
Tags: , , ,

What do you do when you encounter a bug that is really hard to debug? Here are a few tips & tricks:

Breakpoints with counter

If your program consistently throws an exception at a specific line, you want to break before the crash happens, and monitor how the program behaves. But just putting a breakpoint is not good, because it will stop whenever you reach that line, and not necessarily in the iteration that throws the exception. To know where to stop, place a breakpoint with a counter just before the suspicious line, set the counter to a high number, and wait for the program to crash. Then, the debugger will show you how many times you passed through the counter, and you’ll use that number to set the target to your breakpoint. In the next run, the breakpoint will stop exactly when you wanted.

Breakpoints with tracing

In Visual Studio, you can instruct the debugger to print values of certain expressions as well as the thread ID and even the call-stack, whenever it reaches a breakpoint. Right-click the breakpoint and choose “when hit…”. The result will appear in the “output” window in the IDE.

Divide and conquer

Suppose your program consistenly throws an exception at a specific line, because of corrupted data (for example, using a pointer that has been deleted). Suppose that you don’t even know the reason, but you know that somewhere before that line, the data got corrupted. So, the reason for the bug could be far away from the line where the exception is thrown. To detect which line of code caused the crash, use a divide-and-conquer approach, if possible.

Let’s regard the program as a series of commands, and suppose the the exception is thrown at line 10,000. You know that the reason for the bug happens somewhere between line 1 and line 10,000. Start by commenting-out lines 5,000-9,999 (assuming it is legal to do so). If the exception is still thrown, you know that the reason for the bug is between line 1 and 4,999. Otherwise, it is between 5,000 and 10,000. Uncomment, and continue by commenting-out half of the suspicious code, until the suspicious area is small enough.

Minidump files in Visual Studio

If your program crashes from time to time, but you cannot reconstruct the bug on your debugger, it is important to get as much information as possible from each crash. You need to write a side application that monitors your application for exceptions, and writes the call-stack and other needed information when an exception occurs.

The best way to do this, is to write a program that attaches to your application as a debugger using the DebugActiveProcess command (see MSDN for documentation). In this debugger program run a function that constantly waits for debug events, using WaitForDebugEvent. A debug event includes a thrown exception, threads being created or deleted, and other events. When you get the debug event, you can check if it is an exception, and get the type of exception (e.g. access violation or division by zero) and thread ID from which the exception was thrown. You should output this information to a text file. Next, you should also output a dump-file that will include the call-stack of all running threads using the command MiniDumpWriteDump.

The result is a “.dmp” file that you can load into Visual Studio, and see the running threads and call-stacks at the time the exception was thrown. To do this, you need to place the dmp file near the executable, and make sure you have PDB files for the relevant executables (EXE and DLL) that you wish to debug. The PDB files should be next to the DLLs. You also need to have the source-code available. Of course, source-code and PDB files should be exactly the ones that were used and created when building the specific version of the application for which the dmp file was created.

If your application does not throw a lot of exceptions on a regular basis, then running such a process that montiors exceptions in your application does not slow it down.It is safe to even distribute your application with this process attached. This way, when a customers complain about a crash, you can ask them to send you the dmp file (or do it automatically), andany other information that your process generates automatically, and you’ll be able to see excatly where your program crashed.

It is important to configure Visual Studio to save debug information (PDB files) even in the release version, for this purpose. It doesn’t slow down the application. You don’t need to distribute the PDB files – you only need to store them on your side. You’ll need one copy of the entire collection of PDB files for each build.


Breakpoints are sometimes not useful, because they block execution, and thereby change the timing of the application. I find it useful to place “Beep” commands at certain places in the code, so that I hear a beep, at a different tone, every time certain lines of code are executed. This way I use my hearing to monitor the execution of my application.


For crashes that are hard to reproduce, it is sometimes necessary to keep a detailed log of the operations that preceeded the crash. Writing a log-file isn’t trivial, especially in a multithreaded application. You need your logger to be able to write lines to the log-file from different threads, without slowing down the application. Simply writing a line to an open file is not good, because when the application crashes, the file will not be closed properly, so the log-file will be missing the last lines, which are the most important ones for debugging.

A good way to write to a log file is via a second process, called a “logger process“. In the application that you wish to debug, perform logging by posting messages to the logger process, using the non-blocking command PostMessage. Use the three parameters of the message – message identifier, wparam and lparam, to send the important information that you want to write to the log file. In the logger process, the window procedure or message loop should handle the incoming messages by writing the appropriate data to the log file.

The data that should appear in a log-file includes the time, thread ID, and the string you want to write. There should be a mapping between message identifier and the string that appears in the log-file. To transfer the mapping to the logger process, you can use WM_COPYDATA.

If you need to pass more information than fits in the three message parameters, consider using named pipes for communicating with the logger proceses.

Alternatively, you can implement a logger not as a different process, but as a dedicated thread inside your application. You can use APCs (the command QueueUserAPC) to send log requests to the logger thread. A dedicated thread is more efficient than accessing the log-file from different threads, because it doesn’t require explicit synchronization (i.e. mutex, semaphore or critical section). The downside of this approach is that you have to close the log-file every time you add a line to it (because you want to close properly upon crash), and this means slowing the application.


If you suspect data corruption due to memory buffer overrun, I propose that you run your application with the pageheap flag turned on. Pageheap is a flag that Windows can apply globally or to a specific executable. It tells Windows to change the way it does heap allocations. It allocates buffers on the heap such that they are follows by a reserved page. If your program tried to access an entry outside the valid range of the buffer, it will throw an exception of type access violation, and you can use your debugger to understand why it happened.

To use it, you will need gflags.exe, which is part of Windows Debugging Tools. You can download a 32 bit version and a 64 bit version for free. After installing the Winbdows Debugging Tools, run gflags.exe (as in global flags), to see the flags that can be turned on and off.

Turning on Pageheap will slow-down your application, but not as much as other memory monitoring applications do (such as Bounds Checker and Rational Purify).

July 8, 2009

Thread Local Storage

Filed under: Multithreading — Adi Levin @ 9:13 pm
Tags: , , , ,

Threads of the same process share the process address space. This means that when they read from a given address all threads see the same value. So, static variables and variables on the heap are shared by threads, whereas data on the stack (such as automatic variables and function parameters) is local to each thread.

In some cases, we need to store different versions of a certain variable for each thread – but not on the stack (since data on the stack get lost after the function exits). This is provided by TLS (Thread Local Storage).


Here’s an example, for motivation: Suppose you are writing a graphic application that needs to render a given set of N 3D objects to the screen, and you want the rendering to be multithreaded, in order to achieve highest performance. One way to do it is to define N tasks, where the job of each task is to render one object, and then use a team of threads to run the tasks concurrently. Suppose the number of processors in our machine is small (2 or 4), and much smaller than N.

To avoid a race-condition, parallel threads should not render to the same target image simultaneously. An efficient way to solve it is to have every thread render to its own (private) target image. After the threads have all finished, we compute the final image by merging the private images of the different threads.

It is better to keep the data thread-local than task-local, because the tasks can be fine-grained. By that I mean that we wouldn’t want to define a different target image for each of the N objects – this would increase the overhead of merging the images – especially if N is large and each object only covers a small segment of the screen.

Copy instead of Synchronize

More generally, when threads share data for reading and writing, you can choose between synchonizing the access, and making thread-local copies of the data. Synchronization (by means of Mutexes, Semaphores etc…) is sometimes less efficient because it blocks calculation. It may be more efficient to make a thread-local copy of the entire data structure at the beginning of the algorithm, let each thread work independently, and then join the results from the different threads when they end. This way, there is only one synchronization point at the end. Of course, you pay by using more memory.


Windows has built-in support for thread-local storage. The most useful API functions for TLS are TlsAlloc, TlsFree, TlsGetValue and TlsSetValue.

DWORD WINAPI TlsAlloc(void);

BOOL WINAPI TlsFree(__in  DWORD dwTlsIndex);

LPVOID WINAPI TlsGetValue(__in  DWORD dwTlsIndex);

BOOL WINAPI TlsSetValue(__in DWORD dwTlsIndex, __in_opt  LPVOID lpTlsValue); 

TlsAlloc is used to allocate a slot for thread-local storage. It returns a number that will be used for accessing a thread-local value. Only one thread needs to allocate the index, and then other threads can use it. When allocating a slot, its values for all threads are set to zero. TlsFree is used to deallocate the slot.

TlsGetValue and TlsSetValue access the value of the TLS variable, as seen by the calling thread. Initially, the value is set to NULL (zero) for all threads. The value is a pointer (void*), which typically points to a dynamically-allocated structure. In the particular example that was discussed above, this will be a pointer to the thread-local target image.

Capacity limitations

TlsAlloc returns TLS_OUT_OF_INDEXES if there are no more available slots. Notice that in Win32 the number of available slots cannot exceed 1088, and the limitation can even be smaller. The cosntant TLS_MINIMUM_AVAILABLE defines the minimum number of TLS slots available in each process.

Because of the limitation that only 1088 TLS slots are available, you can’t afford to carelessly spend them. When you have different kinds of data that you want to make thread-local, try to consolidate them into one structure, and use one TLS slot for all of them. Each thread is responsible for allocating and freeing the structure that is pointed by the TLS value, but only one thread should call TlsFree after all threads have finished using the TLS slot.

MSDN recommends to use TLS in Dynamic-Linked-Libraries, by allocating and freenig the TLS slot in the DLL’s entry point (The function DllMain). TlsAlloc is called when the DLL attaches to a process, and TlsFree is called when the DLL detaches from the process. The allocation of the structure pointed by the TLS value, and the call to TlsSetValue, happens when a DLL attaches to a thread, and deallocation happens when a DLL detaches from a thread.

This approach means that you use one TLS slot per DLL, which is good for keeping their number low, so that we won’t risk failures of TlsAlloc. The disadvantage is that it doesn’t sit well with object-oriented programming. For every new object or algorithm that requires thread-local data, we have to update the data-type that is pointer by the TLS slot.

Blog at