PAR Class 7, Mon 2019-02-04

W Randolph Franklin, RPI

2019-02-04 00:00

Source

Table of contents::

1 OpenMP lessons
2 Types of memory allocation
3 Optional Homework - bring your answers and discuss next week
4 NVidia device architecture, start of CUDA

1 OpenMP lessons

What you should have learned from the OpenMP lectures:

How to use a well-designed and widely used API that is useful on almost all current CPUs.
Shared memory: the most common parallel architecture.
How to structure a program to be parallelizable.
Issues in parallel programs, like nondeterminism, race conditions, critical regions, etc.

2 Types of memory allocation

Here's a brief overview of my understanding of the various places that you can assign memory in a program.

Static. Define a fixed-size array global array. The variable is constructed at compile time, so accesses might perhaps be faster. Global vars with non default initial values increase the executable file size. If they're large enough, you need to use the compiler option -mcmodel=medium or -mcmodel=large. I don't know the effect on the program's size or speed.
Stack. Define local arrays, that are created and freed as the routine is entered and exited. Their addresses relative to the base of this call frame may be constant. The default stack size is 8MB. You can increase this with the command ulimit or in the program as shown in stacksize.cc. I believe that in OpenMP, the max stacksize may be allocated when each thread is created. Then, a really big stackssize might have a penalty.
Heap. You use new and destroy. Variables are constructed whenever you want. The more objects on the heap, the more time that each new or destroy takes. If you have lots of objects consider using placement new or creating an array of them.

I like to use static, then stack, and heap only when necessary.

Google's allocator is noticeably better than the default one. To use it, link your programs with -ltcmalloc. You can often use it on an existing executable foo thus:

LD_PRELOAD="/usr/lib/libtcmalloc.so" foo

I found it to save 15% to 30% in time.

Another memory concern is speed. Parallel has a NUMA (Non Uniform Memory Architecture). It has two 14-core Xeons. Each core has 128GB of main memory. Although all 256GB are in a common address space, accessing memory on same core as the thread is running on is faster.

The following is what I think based on some research, but may be wrong: A 4KB page of memory is assigned to a specific core when it is first written (not when it is reserved). So, each page of a large array may be on a different core. This can be used to optimize things. This gets more fun with 8-processor systems.

All that is separate from cache issues.

You can also assign your OpenMP threads to specific cores. This affects speed in ways I don't understand. The issues are resource sharing vs conflicts.

3 Optional Homework - bring your answers and discuss next week

Write a program to multiply two 100x100 matrices. Do it the conventional way, not using anything fancy like Schonhage-Strassen. Now, see how much improvement you can get with OpenMP. Measure only the elapsed time for the multiplication, not for the matrix initialization.

Report these execution times.
1. W/o openmp enabled (Don't use -fopenmp. Comment out the pragmas.)
2. With openmp, using only 1 thread.
3. Using 2, 4, 8, 16, 32, 64 threads.
Write programs to test the effect of the reduction pragma:
1. Create an array of 1,000,000,000 floats and fill it with pseudorandom numbers from 0 to 1.
2. Do the following tests with 1, 2, 4, 8, 16, and 32 threads.
Programs to write and test:
1. Sum it with a simple for loop. This will give a wrong answer with more than 1 thread, but is fast.
2. Sum it with the subtotal variable protected with a atomic pragma.
3. Sum it with the subtotal variable protected with a critical pragma.
4. Sum it with a reduction loop.
Devise a test program to estimate the time to execute a task pragma. You might start with use taskfib.cc.
Sometime parallelizing a program can increase its elapsed time. Try to create such an example, with 2 threads being slower than 1.

4 NVidia device architecture, start of CUDA

The above shared memory model hits a wall; CUDA handles the other side of the wall.
See Stanford's parallel course notes, which they've made freely available. (Thanks.)
1. They are very well done.
2. They are obsolete in some places, which I'll mention.
3. However, a lot of newer CUDA material is also obsolete.
4. Nevertheless, the principles are still mostly valid.