Adi Levin's Blog for programmers

July 8, 2009

Thread Local Storage

Filed under: Multithreading — Adi Levin @ 9:13 pm
Tags: , , , ,

Threads of the same process share the process address space. This means that when they read from a given address all threads see the same value. So, static variables and variables on the heap are shared by threads, whereas data on the stack (such as automatic variables and function parameters) is local to each thread.

In some cases, we need to store different versions of a certain variable for each thread – but not on the stack (since data on the stack get lost after the function exits). This is provided by TLS (Thread Local Storage).

Motivation

Here’s an example, for motivation: Suppose you are writing a graphic application that needs to render a given set of N 3D objects to the screen, and you want the rendering to be multithreaded, in order to achieve highest performance. One way to do it is to define N tasks, where the job of each task is to render one object, and then use a team of threads to run the tasks concurrently. Suppose the number of processors in our machine is small (2 or 4), and much smaller than N.

To avoid a race-condition, parallel threads should not render to the same target image simultaneously. An efficient way to solve it is to have every thread render to its own (private) target image. After the threads have all finished, we compute the final image by merging the private images of the different threads.

It is better to keep the data thread-local than task-local, because the tasks can be fine-grained. By that I mean that we wouldn’t want to define a different target image for each of the N objects – this would increase the overhead of merging the images – especially if N is large and each object only covers a small segment of the screen.

Copy instead of Synchronize

More generally, when threads share data for reading and writing, you can choose between synchonizing the access, and making thread-local copies of the data. Synchronization (by means of Mutexes, Semaphores etc…) is sometimes less efficient because it blocks calculation. It may be more efficient to make a thread-local copy of the entire data structure at the beginning of the algorithm, let each thread work independently, and then join the results from the different threads when they end. This way, there is only one synchronization point at the end. Of course, you pay by using more memory.

Usage

Windows has built-in support for thread-local storage. The most useful API functions for TLS are TlsAlloc, TlsFree, TlsGetValue and TlsSetValue.

DWORD WINAPI TlsAlloc(void);

BOOL WINAPI TlsFree(__in  DWORD dwTlsIndex);

LPVOID WINAPI TlsGetValue(__in  DWORD dwTlsIndex);

BOOL WINAPI TlsSetValue(__in DWORD dwTlsIndex, __in_opt  LPVOID lpTlsValue); 

TlsAlloc is used to allocate a slot for thread-local storage. It returns a number that will be used for accessing a thread-local value. Only one thread needs to allocate the index, and then other threads can use it. When allocating a slot, its values for all threads are set to zero. TlsFree is used to deallocate the slot.

TlsGetValue and TlsSetValue access the value of the TLS variable, as seen by the calling thread. Initially, the value is set to NULL (zero) for all threads. The value is a pointer (void*), which typically points to a dynamically-allocated structure. In the particular example that was discussed above, this will be a pointer to the thread-local target image.

Capacity limitations

TlsAlloc returns TLS_OUT_OF_INDEXES if there are no more available slots. Notice that in Win32 the number of available slots cannot exceed 1088, and the limitation can even be smaller. The cosntant TLS_MINIMUM_AVAILABLE defines the minimum number of TLS slots available in each process.

Because of the limitation that only 1088 TLS slots are available, you can’t afford to carelessly spend them. When you have different kinds of data that you want to make thread-local, try to consolidate them into one structure, and use one TLS slot for all of them. Each thread is responsible for allocating and freeing the structure that is pointed by the TLS value, but only one thread should call TlsFree after all threads have finished using the TLS slot.

MSDN recommends to use TLS in Dynamic-Linked-Libraries, by allocating and freenig the TLS slot in the DLL’s entry point (The function DllMain). TlsAlloc is called when the DLL attaches to a process, and TlsFree is called when the DLL detaches from the process. The allocation of the structure pointed by the TLS value, and the call to TlsSetValue, happens when a DLL attaches to a thread, and deallocation happens when a DLL detaches from a thread.

This approach means that you use one TLS slot per DLL, which is good for keeping their number low, so that we won’t risk failures of TlsAlloc. The disadvantage is that it doesn’t sit well with object-oriented programming. For every new object or algorithm that requires thread-local data, we have to update the data-type that is pointer by the TLS slot.

Advertisements

Leave a Comment »

No comments yet.

RSS feed for comments on this post. TrackBack URI

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Blog at WordPress.com.

%d bloggers like this: