PAR Class 3, Wed 2018-01-31

W Randolph Franklin, RPI

2018-01-31 00:00

Source

We saw that running even a short OpenMP program could be unpredictable. The reason for that was a big lesson.

There are no guarantees about the scheduling of the various threads. There are no guarantees about fairness of resource allocation between the threads. Perhaps thread 0 finishes before thread 1 starts. Perhaps thread 1 finishes before thread 2 starts. Perhaps thread 0 executes 3 machine instructions, then thread 1 executes 1 instruction, then thread 0 executes 100 more, etc. Each time the process runs, something different may happen. One thing may happen almost all the time, to trick you.

This happened during the first space shuttle launch in 1981. A 1-in-70 chance event prevented the computers from synchonizing. This had been observed once before during testing. However when they couldn't get it to happen again, they ignored it.

This interlacing of different threads can happen at the machine instruction level. The C statement K=K+3 can translate into the machine instructions

LOAD K; ADD 3; STORE K.

Let's color the threads red and blue.

If the threads interlace thus:

LOAD K; ADD 3; STORE K; LOAD K; ADD 3; STORE K

then K increases by 6. If they interlace thus:

LOAD K; ADD 3; LOAD K; ADD 3; STORE K; STORE K;

then K increases by 3.
This can be really hard to debug, particularly when you don't know where this is happening.

One solution is to serialize the offending code, so that only one thread executes it at a time. The limit would be to serialize the whole program and not to use parallel processing.
OpenMP has several mechanisms to help.
A critical pragma serializes the following block. There are two considerations.
1. You lose the benefits of parallelization on that block.
2. There is an overhead to set up a critical block that I estimate might be 100,000 instructions.
An atomic pragma serializes one short C statement, which must be one of a small set of allowed operations, such as increment. It matches certain atomic machine instructions, like test-and-set. The overhead is much smaller.
If every thread is summing into a common total variable, the reduction pragma causes each thread to sum into a private subtotal, and then sum the subtotals. This is very fast.
Another lesson is that sometimes you can check your program's correctness with an independent computation. For the trivial example of summing \(i^2\), use the formula

\(\sum_{i=1}^N i^2 = N(N+1)(2N+1)/6\).

There is a lot of useful algebra if you have the time to learn it. I, ahem, learned this formula in high school.
Another lesson is that even when the program gives the same answer every time, it might still be consistently wrong.
Another is that just including OpenMP facilities, like -fopenmp, into your program slows it down even if you don't use them.
Another is that the only meaningful time metric is elapsed real time. One reason that CPU time is meaningless is the OpenMP sometimes pauses a thread's progress by making it do a CPU-bound loop. (That is also a common technique in HW.)
Also note that CPU times can vary considerably with successive executions.
Also, using too many threads might increase the real time.
Finally, floating point numbers have their own problems. They are an approximation of the mathematical concept called the real number field. That is defined by 11 axioms that state obvious properties, like

A+B=B+A (commutativity) and

A+(B+C)=(A+B+C) (associativity).

This is covered in courses on modern algebra or abstract algebra.

The problem is that most of these are violated, at least somewhat, with floats. E.g.,

\(\left(10^{20}-10^{20}\right)+1=0+1=1\) but

\(10^{20}+\left(-10^{20}+1\right)=10^{20}-10^{20}=0\).

Therefore when threads execute in a different order, floating results might be different.

There is no perfect solution, though using double is a start.
1. On modern CPUs, double is just as fast as float.
2. However it's slower on GPUs.
  
  How much slower can vary. Nvidia's Maxwell line spent very little real estate on double precision, so it was very slow. In contrast, Maxwell added a half-precision float, for implementing neural nets. Apparently neural nets use very few significant digits.
  
  Nvidia's newer Pascal line reverted this design choice and spends more real estate on implementing double. parallel.ecse's GPU is Pascal.
3. It also requires moving twice the data, and data movement is often more important than CPU time.
The large field of numerical analysis is devoted to finding solutions, with more robust and stable algorithms.

Summing an array by first sorting, and then summing the absolutely smaller elements first is one technique. Inverting a matrix by pivoting on the absolutely largest element, instead of on \(a_{11}\) is another.

Compiling and running

c++ compilers: g++ 7.2 is installed. I can add g++-6 if you wish.
Measuring program times when many people are on the machine:
1. This is a problem.
2. One partial solution is to use batch to queue your job to run when the system is unloaded.

More OpenMP

Another nice, albeit obsolescent and incomplete, ref: Wikibooks OpenMP.
Its tasks examples are interesting.
OpenMP tasks
1. While inside a pragma parallel, you queue up lots of tasks - they're executed as threads are available.
2. My example is tasks.cc with some variants.
3. taskfib.cc is modified from an example in the OpenMP spec.
  1. It shows how tasks can recursively spawn more tasks.
  2. This is only an example; you would never implement fibonacci that way. (The fastest way is the following closed form. In spite of the sqrts, the expression always gives an integer. )
    
    \(F_n = \frac{\left( \frac{1+\sqrt{5}}{2} \right) ^n - \left( \frac{1-\sqrt{5}}{2} \right) ^n }{\sqrt{5}}\)
  3. Spawning a task is expensive; we can calculate the cost.
stacksize.cc - get and set stack size. Can also do ulimit -s.
omps.cc - read assorted OMP variables.
Since I/O is not thread-safe, in hello_single_barrier.cc it has to be protected by a critical and also followed by a barrier.
OpenMP sections - my example is sections.cc with some variants.
1. Note my cool way to print an expression's name followed by its value.
2. Note the 3 required levels of pragmas: parallel, sections, section.
3. The assignment of sections to threads is nondeterministic.
4. IMNSHO, OpenMP considerably easier than pthreads, fork, etc.
http://stackoverflow.com/questions/13788638/difference-between-section-and-task-openmp
What you should have learned from the OpenMP lectures:
1. How to use a well-designed and widely used API that is useful on almost all current CPUs.
2. Shared memory: the most common parallel architecture.
3. How to structure a program to be parallelizable.
4. Issues in parallel programs, like nondeterminism, race conditions, critical regions, etc.

ssh, afs, zfs

I recommend that you create key pairs to connect to geoxeon and parallel w/o typing your password each time.
To avoid retyping your ssh private key passphrase in the same shell, do this

ssh-add
One advantage of ssh is that you can mount your geoxeon directory on your local machine. On my laptop, I run nemo, then do File - Connect to server, choose type ssh, server geoxeon.ecse.rpi.edu, use my RCSID as the user name, leave the password blank, and give my private key passphrase if asked. Any other program to mount network shares would work as well.

Where the share is actually mounted varies. On my laptop, it's here: /var/run/user/1000/gvfs/sftp:host=geoxeon.ecse.rpi.edu,user=wrf

As with any network mount, fancy filesystem things like attributes, simultaneous reading and writing from different machines, etc., probably won't work.
With ssh, you can also copy files back and forth w/o mounting or typing passwords:
1. scp localfile geoxeon.ecse.rpi.edu:
2. scp -r localdir geoxeon.ecse.rpi.edu:
3. scp geoxeon.ecse.rpi.edu:remotefile .
It even does filename completion on remote files.
You can also run single commands:

ssh geoxeon.ecse.rpi.edu hostname
Geoxeon sometimes implements AFS, so your RCS files are accessible (read and write) at

/afs/rpi/edu/home/??/RCSUID.

Search for your home dir thus:

ls -ld /afs/rpi.edu/home//RCSID*.

Authenticate yourself with klog.
On geoxeon, your files are transparently compressed, so there's no need to explicitly use gzip etc.

Types of memory allocation

Here's a brief overview of my understanding of the various places that you can assign memory in a program.

Static. Define a fixed-size array global array. The variable is constructed at compile time, so accesses might perhaps be faster. Global vars with non default initial values increase the executable file size. If they're large enough, you need to use the compiler option -mcmodel=medium or -mcmodel=large. I don't know the effect on the program's size or speed.
Stack. Define local arrays, that are created and freed as the routine is entered and exited. Their addresses relative to the base of this call frame may be constant. The default stack size is 8MB. You can increase this with the command ulimit or in the program as shown in stacksize.cc. I believe that in OpenMP, the max stacksize may be allocated when each thread is created. Then, a really big stackssize might have a penalty.
Heap. You use new and destroy. Variables are constructed whenever you want. The more objects on the heap, the more time that each new or destroy takes. If you have lots of objects consider using placement new or creating an array of them.

I like to use static, then stack, and heap only when necessary.

Google's allocator is noticeably better than the default one. To use it, link your programs with -ltcmalloc. You can often use it on an existing executable foo thus:

LD_PRELOAD="/usr/lib/libtcmalloc.so" foo

I found it to save 15% to 30% in time.

Another memory concern is speed. Geoxeon has a NUMA (Non Uniform Memory Architecture). It has two 8-core Xeons. Each core has 64GB of main memory. Although all 128GB are in a common address space, accessing memory on same core as the thread is running on is faster.

The following is what I think based on some research, but may be wrong: A 4KB page of memory is assigned to a specific core when it is first written (not when it is reserved). So, each page of a large array may be on a different core. This can be used to optimize things. This gets more fun with 8-processor systems.

All that is separate from cache issues.

You can also assign your OpenMP threads to specific cores. This affects speed in ways I don't understand. The issues are resource sharing vs conflicts.

3D-EPUG-Overlay - Big program using OpenMP

Salles Magalhães's PhD.

https://wrf.ecse.rpi.edu/nikola/pages/software/#id5

NVidia device architecture, start of CUDA

The above shared memory model hits a wall; CUDA handles the other side of the wall.

CUDA

See Stanford's parallel course notes, which they've made freely available. (Thanks.)

They are very well done.
They are obsolete in some places, which I'll mention.
However, a lot of newer CUDA material is also obsolete.
Nevertheless, the principles are still mostly valid, which is why I mention them. (Also, it's hard to find new parallel courses online, and I've looked.)