PAR Lecture 4, Mon Jan 30

W Randolph Franklin, RPI

2017-01-30 00:00

Comments

Today is the last day to switch to the 6000 version of this course.
Accounts have been set up on parallel.ecse.rpi.edu .
parallel is accessible from on campus; from off campus use the VPN.
I copied over all your passwords and home directories from geoxeon.
I copied over geoxeon:/home/parcomp/Class including the yellowcab file
On both machines, I created a link /parallel-class to point to /home/parcomp/Class
1. That will get you specific material for this class.
2. However you're welcome to roam around on geoxeon to look at stuff from last time and to look at other yellowcab files etc.
3. My initial security policy for this class is to install reasonable security but then trust you not to abuse it.
4. Should you find something that is readable but probably should be protected, please tell me.
However, they are two separate computers that are not synchonized.
1. Changing your password on one machine won't change it on the other.
2. The HW is different.
3. The OS is different: Ubuntu on geoxeon, and CentOS on parallel.
4. Versions of SW may be different.
5. CentOS is a more production-oriented OS, and so by default installs older versions.
6. I'm working to put current SW on parallel, but it takes time since I'm the person maintaining it.
I recommend using parallel whenever you can.
I may put new stuff only on parallel.
However, currently only geoxeon has CUDA fully installed.
We saw that running even a short OpenMP program could be unpredictable. The reason for that was a big lesson.

There are no guarantees about the scheduling of the various threads. There are no guarantees about fairness of resource allocation between the threads. Perhaps thread 0 finishes before thread 1 starts. Perhaps thread 1 finishes before thread 2 starts. Perhaps thread 0 executes 3 machine instructions, then thread 1 executes 1 instruction, then thread 0 executes 100 more, etc. Each time the process runs, something different may happen. One thing may happen almost all the time, to trick you.

This happened during the first space shuttle launch in 1981. A 1-in-70 chance event prevented the computers from synchonizing. This had been observed once before during testing. However when they couldn't get it to happen again, they ignored it.

This interlacing of different threads can happen at the machine instruction level. The C statement K=K+3 can translate into the machine instructions

LOAD K; ADD 3; STORE K.

Let's color the threads red and blue.

If the threads interlace thus:

LOAD K; ADD 3; STORE K; LOAD K; ADD 3; STORE K

then K increases by 6. If they interlace thus:

LOAD K; ADD 3; LOAD K; ADD 3; STORE K; STORE K;

then K increases by 3.
This can be really hard to debug, particularly when you don't know where this is happening.

One solution is to serialize the offending code, so that only one thread executes it at a time. The limit would be to serialize the whole program and not to use parallel processing.
OpenMP has several mechanisms to help.
A critical pragma serializes the following block. There are two considerations.
1. You lose the benefits of parallelization on that block.
2. There is an overhead to set up a critical block that I estimate might be 100,000 instructions.
An atomic pragma serializes one short C statement, which must be one of a small set of allowed operations, such as increment. It matches certain atomic machine instructions, like test-and-set. The overhead is much smaller.
If every thread is summing into a common total variable, the reduction pragma causes each thread to sum into a private subtotal, and then sum the subtotals. This is very fast.
Another lesson is that sometimes you can check your program's correctness with an independent computation. For the trivial example of summing \(i^2\), use the formula

\(\sum_{i=1}^N i^2 = N(N+1)(2N+1)/6\).

There is a lot of useful algebra if you have the time to learn it. I, ahem, learned this formula in high school.
Another lesson is that even when the program gives the same answer every time, it might still be consistently wrong.
Another is that just including OpenMP facilities, like -fopenmp, into your program slows it down even if you don't use them.
Another is that the only meaningful time metric is elapsed real time. One reason that CPU time is meaningless is the OpenMP sometimes pauses a thread's progress by making it do a CPU-bound loop. (That is also a common technique in HW.)
Also note that CPU times can vary considerably with successive executions.
Also, using too many threads might increase the real time.
Finally, floating point numbers have their own problems. They are an approximation of the mathematical concept called the real number field. That is defined by 11 axioms that state obvious properties, like

A+B=B+A (commutativity) and

A+(B+C)=(A+B+C) (associativity).

This is covered in courses on modern algebra or abstract algebra.

The problem is that most of these are violated, at least somewhat, with floats. E.g.,

\(\left(10^{20}-10^{20}\right)+1=0+1=1\) but

\(10^{20}+\left(-10^{20}+1\right)=10^{20}-10^{20}=0\).

Therefore when threads execute in a different order, floating results might be different.

There is no perfect solution, though using double is a start.
1. On modern CPUs, double is just as fast as float.
2. However it's slower on GPUs.
  
  How much slower can vary. Nvidia's Maxwell line spent very little real estate on double precision, so it was very slow. In contrast, Maxwell added a half-precision float, for implementing neural nets. Apparently neural nets use very few significant digits.
  
  Nvidia's newer Pascal line reverted this design choice and spends more real estate on implementing double. parallel.ecse's GPU is Pascal.
3. It also requires moving twice the data, and data movement is often more important than CPU time.
The large field of numerical analysis is devoted to finding solutions, with more robust and stable algorithms.

Summing an array by first sorting, and then summing the absolutely smaller elements first is one technique. Inverting a matrix by pivoting on the absolutely largest element, instead of on \(a_{11}\) is another.
Next class: start CUDA.