PAR Class 5, Mon 2021-02-08

W Randolph Franklin (WRF), Rensselaer Polytechnic Institute (RPI)

2021-02-06 00:00

Source

Table of contents

1 ssh, afs, zfs

I recommend that you create key pairs to connect to parallel w/o typing your password each time.
To avoid retyping your ssh private key passphrase in the same shell, do this

ssh-add
One advantage of ssh is that you can mount your parallel directory on your local machine. In the files browser on my laptop, I connect to the server sftp://parallel.ecse.rpi.edu . Any other program to mount network shares would work as well.

Where the share is actually mounted varies. On my laptop, it's here: /var/run/user/1000/gvfs/sftp:host=parallel.ecse.rpi.edu,user=wrf

1000 is my uid on my laptop. Yours is not 1000.

As with any network mount, fancy filesystem things like attributes, simultaneous reading and writing from different machines, etc., probably won't work.
With ssh, you can also copy files back and forth w/o mounting or typing passwords:
1. scp localfile parallel.ecse.rpi.edu:
2. scp -r localdir parallel.ecse.rpi.edu:
3. scp parallel.ecse.rpi.edu:remotefile .
It even does filename completion on remote files.
You can also run single commands:

ssh parallel.ecse.rpi.edu hostname
On parallel, most filesystems use zfs.
1. Files are transparently compressed, so there's no need to explicitly use gzip etc.
2. Filesystems are automatically snapshotted, so you can often recover accidentally deleted or changed files. Look under FSROOT /.zfs/snapshot/

2 Stack size

Each thread has its own stack.
Default size is small, but easy and cheap to increase.
Since pages (for stacks and everything) are not allocated until touched, reserving large blocks of VM, like for stacks, is free.
stacksize.cc - get and set stack size. Can also do ulimit -s.

3 OpenMP

Versions: OpenMP is a living project, that is being updated every few years.
1. There is a lot of free online documentation, examples, and tutorials.
2. New features are regularly added, such as support for GPU backends.
3. However the documentation and implementations lags.
4. In my course, I'll use obsolete documentation when it's good enough. It's all still correct and fine for most applications.
5. Then I'll present recent some additions.
We'll review some examples running on parallel, at /parclass2022/openmp/rpi/ .
Now that we've seen some examples, look at the LLNL tutorial. https://computing.llnl.gov/tutorials/openMP/
We saw that running even a short OpenMP program could be unpredictable. The reason for that was a big lesson.

There are no guarantees about the scheduling of the various threads. There are no guarantees about fairness of resource allocation between the threads. Perhaps thread 0 finishes before thread 1 starts. Perhaps thread 1 finishes before thread 2 starts. Perhaps thread 0 executes 3 machine instructions, then thread 1 executes 1 instruction, then thread 0 executes 100 more, etc. Each time the process runs, something different may happen. One thing may happen almost all the time, to trick you.

This happened during the first space shuttle launch in 1981. A 1-in-70 chance event prevented the computers from synchonizing. This had been observed once before during testing. However when they couldn't get it to happen again, they ignored it.

This interlacing of different threads can happen at the machine instruction level. The C statement K=K+3 can translate into the machine instructions

LOAD K; ADD 3; STORE K.

Let's color the threads red and blue.

If the threads interlace thus:

LOAD K; ADD 3; STORE K; LOAD K; ADD 3; STORE K

then K increases by 6. If they interlace thus:

LOAD K; ADD 3; LOAD K; ADD 3; STORE K; STORE K;

then K increases by 3.
This can be really hard to debug, particularly when you don't know where this is happening.

One solution is to serialize the offending code, so that only one thread executes it at a time. The limit would be to serialize the whole program and not to use parallel processing.
OpenMP has several mechanisms to help.
A critical pragma serializes the following block. There are two considerations.
1. You lose the benefits of parallelization on that block.
2. There is an overhead to set up a critical block that I estimate might be 100,000 instructions.
An atomic pragma serializes one short C statement, which must be one of a small set of allowed operations, such as increment. It matches certain atomic machine instructions, like test-and-set. The overhead is much smaller.
If every thread is summing into a common total variable, the reduction pragma causes each thread to sum into a private subtotal, and then sum the subtotals. This is very fast.
Another lesson is that sometimes you can check your program's correctness with an independent computation. For the trivial example of summing \(i^2\), use the formula

\(\sum_{i=1}^N i^2 = \frac{N(N+1)(2N+1)}{6}\).

There is a lot of useful algebra if you have the time to learn it. I, ahem, learned this formula in high school.
Another lesson is that even when the program gives the same answer every time, it might still be consistently wrong.
Another is that just including OpenMP facilities, like -fopenmp, into your program slows it down even if you don't use them.
Another is that the only meaningful time metric is elapsed real time. One reason that CPU time is meaningless is the OpenMP sometimes pauses a thread's progress by making it do a CPU-bound loop. (That is also a common technique in HW.)
Also note that CPU times can vary considerably with successive executions.
Also, using too many threads might increase the real time.
Finally, floating point numbers have their own problems. They are an approximation of the mathematical concept called the real number field. That is defined by 11 axioms that state obvious properties, like

A+B=B+A (commutativity) and

A+(B+C)=(A+B+C) (associativity).

This is covered in courses on modern algebra or abstract algebra.

The problem is that most of these are violated, at least somewhat, with floats. E.g.,

\(\left(10^{20}-10^{20}\right)+1=0+1=1\) but

\(10^{20}+\left(-10^{20}+1\right)=10^{20}-10^{20}=0\).

Therefore when threads execute in a different order, floating results might be different.

There is no perfect solution, though using double is a start.
1. On modern CPUs, double is just as fast as float.
2. However it's slower on GPUs.
  
  How much slower can vary. Nvidia's Maxwell line spent very little real estate on double precision, so it was very slow. In contrast, Maxwell added a half-precision float, for implementing neural nets. Apparently neural nets use very few significant digits.
  
  Nvidia's newer Pascal line reverted this design choice and spends more real estate on implementing double. parallel.ecse's GPU is Pascal.
  
  Nvidia's Volta generation also does doubles well;
3. It also requires moving twice the data, and data movement is often more important than CPU time.
The large field of numerical analysis is devoted to finding solutions, with more robust and stable algorithms.

Summing an array by first sorting, and then summing the absolutely smaller elements first is one technique. Inverting a matrix by pivoting on the absolutely largest element, instead of on \(a_{11}\) is another.
Another nice, albeit obsolescent and incomplete, ref: Wikibooks OpenMP.
Its tasks examples are interesting.
OpenMP tasks
1. While inside a pragma parallel, you queue up lots of tasks - they're executed as threads are available.
2. My example is tasks.cc with some variants.
3. taskfib.cc is modified from an example in the OpenMP spec.
  1. It shows how tasks can recursively spawn more tasks.
  2. This is only an example; you would never implement fibonacci that way. (The fastest way is the following closed form. In spite of the sqrts, the expression always gives an integer. )
    
    \(F_n = \frac{\left( \frac{1+\sqrt{5}}{2} \right) ^n - \left( \frac{1-\sqrt{5}}{2} \right) ^n }{\sqrt{5}}\)
  3. Spawning a task is expensive; we can calculate the cost.
omps.cc - read assorted OMP variables.
Since I/O is not thread-safe, in hello_single_barrier.cc it has to be protected by a critical and also followed by a barrier.
OpenMP sections - my example is sections.cc with some variants.
1. Note my cool way to print an expression's name followed by its value.
2. Note the 3 required levels of pragmas: parallel, sections, section.
3. The assignment of sections to threads is nondeterministic.
4. IMNSHO, OpenMP considerably easier than pthreads, fork, etc.
http://stackoverflow.com/questions/13788638/difference-between-section-and-task-openmp
OpenMP is weak on compiling for GPUs. So....

4 OpenACC

Newer than OpenMP.
More abstract; compiler tries to determine what to parallelize.
Your algorithm needs to be parallelizable. That's the hardest part.
https://www.openacc.org/
I'll walk through the tutorials from https://www.openacc.org/events/openacc-online-course-2018 .

They also introduce you to Nvidia.

Today: week 1.

5 Lots of compilers

g++
nvcc - specifically for cuda.
pgc++ - commercial. Perhaps better than g++, especially for Nvidia.
nvc++ - nvidia's takeover of pgc++. perhaps the best. I'll install it RSN.

6 Homework 3

is online here , due in a week.

7 Next topic

Nvidia