Adi Levin's Blog for programmers

June 7, 2009

CreateThread: An example

Filed under: Multithreading — Adi Levin @ 9:27 pm
Tags: , , , ,

The most basic way to invoke a thread is via the CreateThread function. Let’s examine a short program that uses CreateThread. Even though it is very simple, it raises important questions. To run the example, simply copy-and-paste it into a main.cpp file of a console application.

The example program (download this program)

#include <windows.h>
#include <iostream>
#include <intrin.h>

#define N 30000000
#define K 2

static double calc_z(double x, double y) {
    double z = 0;
    for(int i=0;i<K;++i)
        z += x*i + y – i;
    return z;
}
 
struct range
{
    range(const double* in_x, const double* in_y, double* out_z, int in_start_index, int in_end_index, int in_step) :
        x(in_x), y(in_y), z(out_z),
        start_index(in_start_index), end_index(in_end_index), step(in_step) {}     
    const double* x;
    const double* y;
    double* z;
    int start_index;
    int end_index;
    int step;
};
 
static DWORD WINAPI calcmany(void *param)
{
    range* r = (range*)param;
    for(int i=r->start_index; i<r->end_index; i+= r->step)
        r->z[i] = calc_z(r->x[i],r->y[i]);
       return 0;
}
 
void compute_z_values_serial(const double* in_x, const double* in_y, int in_num, double* out_z)
{  
    range r(in_x, in_y, out_z, 0, in_num, 1);
    calcmany(&r);
}
 
void compute_z_values_parallel_1(const double* in_x, const double* in_y, int in_num, double* out_z)
{  
    HANDLE h[2];
    DWORD thread_id;
    range r0(in_x, in_y, out_z, 0, in_num/2, 1);
    h[0] = CreateThread(NULL,1<<24,calcmany,&r0,0,&thread_id);
   
    range r1(in_x, in_y, out_z, in_num/2, in_num, 1);
    h[1] = CreateThread(NULL,1<<24,calcmany,&r1,0,&thread_id);
   
    WaitForMultipleObjects(2,h,TRUE,INFINITE);
}
 
void compute_z_values_parallel_2(const double* in_x, const double* in_y, int in_num, double* out_z)
{  
    HANDLE h[2];
    DWORD thread_id;
    range r0(in_x, in_y, out_z, 0, in_num, 2);
    h[0] = CreateThread(NULL,1<<24,calcmany,&r0,0,&thread_id);
   
    range r1(in_x, in_y, out_z, 1, in_num, 2);
    h[1] = CreateThread(NULL,1<<24,calcmany,&r1,0,&thread_id);
   
    WaitForMultipleObjects(2,h,TRUE,INFINITE);
}
 
void main()
{
    double* x = new double[N];
    double* y = new double[N];
    double* z = new double[N];
    for(int i=0;i<N;++i) {
        x[i] = i*1.0;
        y[i] = i*3.0;
    }
 
   #define TIME(FUNC) { std::cout << #FUNC << “\n”; __int64 t0 = __rdtsc(); FUNC(x,y,N,z); __int64 t1 = __rdtsc(); std::cout << (t1-t0)*1.e-9 << “\n”; }
 
    TIME(compute_z_values_serial);
    TIME(compute_z_values_parallel_1);
    TIME(compute_z_values_parallel_2)
 
    std::cout << “Done.\n”;    
    delete [] x;
    delete [] y;
    delete [] z; 
}

What does this program do?

This program compares the performance of three functions, all doing the same job. The function compute_z_values_serial is not using parallelization. It simply calls z[i]=calc_z(x[i],y[i]) for i ranging from 0 to N-1. The function compute_z_values_parallel_1 does the same thing, but divides the work between two threads, using two calls to CreateThread. The first thread is responsible for computing z[i] for i=0:(N/2-1), and the second thread will compute z[i] for i=N/2 : (N-1). The function compute_z_values_parallel_2 also divides the work between two threads, but in a different way. The first thread computes z[i] for even i, and the second thread computes z[i] for odd values of i.

The function compute_z_values_parallel_1  works as follows: It fills a range structure denoting the range 0:(N/2-1), and then creates a thread, that runs the function calcmany with the given range as parameter. Similarly, a different range structure, denoting the other half of the range, is defined, and a second thread is created, that also runs calcmany on that range. h[0] and h[1] store the handles to the newly created threads. Finally, we call WaitForMultipleObjects with the given h[0] and h[1], which will cause the primary thread to enter a waiting state, until the two worker threads are finished.

Performance analysis: false sharing

The performance depends on the two parameters K and N, #defined at the beginning of the program. K controls the cost of a single call to calc_z, and N controls the number of x,y pairs for which calc_z is called. If K is large then the parallelization works pretty well on a dual-core machine, because the major cost of the program is arithmetic computations, and these are divided between the two computational cores. However, when K is small, the computation doesn’t take so much time, and memory access becomes the bottle-beck. As you play with different values of N, you’ll notice that for some values of N, compute_z_values_parallel_2 performs really bad – even worse than the serial version.

How come the parallel version of the program is slower? This is because of inefficient access to cache memory – a phenomenon known as false sharing. If one thread is currently writing to z[i], and the other thread is trying to write to z[i+1], it may be blocked, because z[i] and z[i+1] are adjacent addresses. Most precisely, the problem happens when z[i] and z[i+1] are in the same cache-line. When the CPU wants to access (read/write) a certain address, it first checks if it is already present in that CPU’s cache. If not, if fetches the relevant cache-line from the cache. A cache-line can be 64 bytes long (depending on the machine). If another CPU now wants to write to an address that is inside that cache-line, the two CPUs will necessarily block each-other. The cache-line will be written back to RAM, and then fetched again by the second CPU.

So, even if the two threads don’t write to the same addresses, they may suffer from false sharing, because they write to adjacent addresses. This doesn’t happen in the case of compute_z_values_parallel_1 , and that is why it works better.

Understanding the architecture of dual-core processors can help you write faster programs.  An intel dual-core processor has a cache for each processing unit (called the L1 cache) and a second level of cache (L2 cache) shared by the two cores. The L2 cache is twice as big as each of the L1 caches. When a cache-line is fetched, it is first fetched to the L2 cache, and then to the L1 cache of the processing unit that asked for it. Thus, if the two cores read from adjacent addresses, this is good – the memory will only be fetched once from RAM to L2 cache (which is the expensive operation). However, if one thread writes to a certain address, and another thread reads from an adjacent address, this is bad (“false sharing“)- it forces the computer to update the main memory, which is an expensive operation.

In a multi-processor environment, the issue of cache hit-rate becomes even more important than in single-processor programming. This is because reading from the cache does not disturb any of the other processors. In case of a cache miss, the processor has to access the global memory via the bus, which is a finite resource shared by all processors. If one processor consumes a lot of memory bandwidth, it decreases the remaining bandwidth for the other processors. Therefore, we try to design our algorithms so that they incur less cache-misses.

Performance analysis: load-balancing

The strategy that this program uses for parallelization is a very simple one: Divide the work to two tasks, then schedule each of the tasks to run simultaneously, and wait until the tasks are completed. In many cases, this is not a good strategy.

If one task takes more time to complete than the other task, we will not use the two processing cores 100% of the time. As soon as the shorter task is finished, there is only one CPU working on the second, longer, task. Even if the two tasks take exactly the same amount of computational resources, or even perform the exact same thing – it doesn’t mean that we will exploit the two CPUs 100% of the time, because the operating system does not necessarily allocate the same amount of CPU time to the two threads. Moreover ,because of other processes running on our machine, one task may begin sooner than the other. We don’t have fine control over the way in which the operating system  schedules threads.

The issue of trying to keep all of the CPUs busy is known as load-balancing. One way to achieve a better load-balance is to subdivide the work to more tasks (much more than the number of processors). Then, each thread that finishes a task doesn’t go to sleep. Instead, it looks for a task that has not been executed it, and runs it. The amount of subdivision to small tasks is called granularity. When the granularity is insufficient, we can expect load-balancing problems. However, when the granularity is too small (too many tasks), the overhead of managing the tasks will compromise the efficiency of our program.

The cost of CreateThread

Using CreateThread everytime we want to start a task on a worker thread is not so good. The task of creating a thread and ending it involves significant overhead. This becomes important when the tasks that we are running are short.

To get a feeling of how expensive CreateThread is, I wrote a simple program and measured the time difference between the call to CreateThread and the beginning of the thread function. I found (on an Inter Core2 CPU, 2.13GHz machine) that it takes about 0.3 millisecond. (This number includes not only the thread creation, but also the context switch). In certain situations, this number is not negligible.

From CodeGuru: “When a thread is created in a process, the system examines all of the DLL file images currently mapped into the process’s address space and calls each one’s DllMain function with a value of DLL_THREAD_ATTACH. This tells all the DLLs to perform any per-thread initialization. The newly created thread is responsible for executing the code in all of the DLLs’ DllMain functions. Only after all the DLLs have had a chance to process this notification does the system allow the new thread to begin executing its thread function”.

 

The alternative to creating a thread is to invoke the function on an existing thread – either by an Asynchronous Procedure Call (APC) or by using the Windows Thread Pool.

CreateThread failure

CreateThread may fail due to insufficient virtual address space. Thread creation involves allocating contiguous virtual addresses for the stack (which is, by default 1Mb), which may fail. When CreateThread fails, it returns a NULL handle, and the function that was supposed to run on the thread is not invoked.

On a 32 bit system, the virtual address space allocated to user memory is 2Gb only, so allocations can fail. This does not happen a lot, but it can happen if:

1. Your program uses a lot of threads (for example, hundreds of threads).

2. Your program uses a lot of virtual memory (for example, over 1.8Gb in a 32 bit system).

3. In CreateThread you require a stack size which is larger than the default 1Mb (for example, 16 Mb).

To protect against such occurances, you should check the returned handle, and deal with such failures.

Advertisements

2 Comments »

  1. “false sharing” – question:
    You say – “Understanding the architecture of dual-core processors can help you write faster programs”
    Do we need to write different code for Intel and different code for AMD ?

    Comment by Doron Osovlanski — September 6, 2009 @ 4:03 pm | Reply

    • In general – no. But, theoretically, if you want to squeeze the maximum performance out of a specific processor, then in some cases, knowledge of the sizes of the cache (L1 and L2) can help you. If you have an Intel dual-core machine, then you know that the two cores share the L2 cache. I don’t know the situation in AMD. But this can help you understand the cost of memory access in your algorithm, and tune your algorithm accordingly. Also, if you know that two tasks read the same data segment, you might benefit from assigning them to the same CPU (but different cores). This way they will share the L2 cache. However, personally, I never felt the need to go that far. The programs that I work on are not targeted to a specific platform.

      Comment by adilevin — September 6, 2009 @ 4:26 pm | Reply


RSS feed for comments on this post. TrackBack URI

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Blog at WordPress.com.

%d bloggers like this: