-
Versions: OpenMP is a living project, that is being updated every few years.
There is a lot of free online documentation, examples, and tutorials.
New features are regularly added, such as support for GPU backends.
However the documentation and implementations lags.
I'll use obsolete documentation when it's good enough. It's all still correct and fine for most applications.
Then I'll present recent some additions.
We'll see some examples running on parallel, at /parclass2022/files/openmp/rpi
-
We saw that running even a short OpenMP program could be
unpredictable. The reason for that was a big lesson.
There are no guarantees about the scheduling of the various
threads. There are no guarantees about fairness of resource
allocation between the threads. Perhaps thread 0 finishes
before thread 1 starts. Perhaps thread 1 finishes before
thread 2 starts. Perhaps thread 0 executes 3 machine
instructions, then thread 1 executes 1 instruction, then thread
0 executes 100 more, etc. Each time the process runs,
something different may happen. One thing may happen almost
all the time, to trick you.
This happened during the first space shuttle launch in 1981. A
1-in-70 chance event prevented the computers from synchonizing.
This had been observed once before during testing. However
when they couldn't get it to happen again, they ignored it.
This interlacing of different threads can happen at the machine
instruction level. The C statement K=K+3 can translate
into the machine instructions
LOAD K; ADD 3; STORE K.
Let's color the threads red and blue.
If the threads interlace thus:
LOAD K; ADD 3; STORE K; LOAD K; ADD 3; STORE K
then K increases by 6. If they interlace thus:
LOAD K; ADD 3; LOAD K; ADD 3; STORE K; STORE K;
then K increases by 3.
-
This can be really hard to debug, particularly when you don't
know where this is happening.
One solution is to serialize the offending code, so that only
one thread executes it at a time. The limit would be to
serialize the whole program and not to use parallel processing.
OpenMP has several mechanisms to help.
-
A critical pragma serializes the following block. There
are two considerations.
You lose the benefits of parallelization on that block.
There is an overhead to set up a critical block that I
estimate might be 100,000 instructions.
An atomic pragma serializes one short C statement, which
must be one of a small set of allowed operations, such as increment. It
matches certain atomic machine instructions, like test-and-set.
The overhead is much smaller.
If every thread is summing into a common total variable, the
reduction pragma causes each thread to sum into a private
subtotal, and then sum the subtotals. This is very fast.
-
Another lesson is that sometimes you can check your program's
correctness with an independent computation. For the trivial
example of summing \(i^2\), use the formula
\(\sum_{i=1}^N i^2 = \frac{N(N+1)(2N+1)}{6}\).
There is a lot of useful algebra if you have the time to learn
it. I, ahem, learned this formula in high school.
Another lesson is that even when the program gives the same
answer every time, it might still be consistently wrong.
Another is that just including OpenMP facilities, like
-fopenmp, into your program slows it down even if you
don't use them.
Another is that the only meaningful time metric is elapsed real
time. One reason that CPU time is meaningless is the OpenMP
sometimes pauses a thread's progress by making it do a
CPU-bound loop. (That is also a common technique in HW.)
Also note that CPU times can vary considerably with successive executions.
Also, using too many threads might increase the real time.
-
Finally, floating point numbers have their own problems. They
are an approximation of the mathematical concept called the
real number field. That is defined by 11 axioms that
state obvious properties, like
A+B=B+A (commutativity) and
A+(B+C)=(A+B+C) (associativity).
This is covered in courses on modern algebra or abstract
algebra.
The problem is that most of these are violated, at least
somewhat, with floats. E.g.,
\(\left(10^{20}-10^{20}\right)+1=0+1=1\) but
\(10^{20}+\left(-10^{20}+1\right)=10^{20}-10^{20}=0\).
Therefore when threads execute in a different order, floating
results might be different.
There is no perfect solution, though using double is a
start.
On modern CPUs, double is just as fast as float.
-
However it's slower on GPUs.
How much slower can vary. Nvidia's Maxwell line spent very little real estate on double precision, so it was very slow. In contrast, Maxwell added a half-precision float, for implementing neural nets. Apparently neural nets use very few significant digits.
Nvidia's newer Pascal line reverted this design choice and spends more real estate on implementing double. parallel.ecse's GPU is Pascal.
Nvidia's Volta generation also does doubles well;
It also requires moving twice the data, and data movement is
often more important than CPU time.
The large field of numerical analysis is devoted
to finding solutions, with more robust and stable algorithms.
Summing an array by first sorting, and then summing the
absolutely smaller elements first is one technique. Inverting
a matrix by pivoting on the absolutely largest element, instead
of on \(a_{11}\) is another.
Now that we've seen some examples, look at the LLNL tutorial.
https://hpc-tutorials.llnl.gov/openmp/
Another nice, albeit obsolescent and incomplete, ref: Wikibooks OpenMP.
Its tasks examples are interesting.
-
OpenMP tasks
While inside a pragma parallel, you queue up lots of tasks - they're executed
as threads are available.
My example is tasks.cc with some variants.
-
taskfib.cc is modified from an example in the OpenMP spec.
It shows how tasks can recursively spawn more tasks.
-
This is only an example; you would never implement fibonacci that
way. (The fastest way is the following closed form. In spite of the
sqrts, the expression always gives an integer. )
\(F_n = \frac{\left( \frac{1+\sqrt{5}}{2} \right) ^n - \left( \frac{1-\sqrt{5}}{2} \right) ^n }{\sqrt{5}}\)
Spawning a task is expensive; we can calculate the cost.
omps.cc - read assorted OMP variables.
Since I/O is not thread-safe, in hello_single_barrier.cc
it has to be protected by a critical and also followed by a barrier.
-
OpenMP sections - my example is sections.cc with some variants.
Note my cool way to print an expression's name followed by its value.
Note the 3 required levels of pragmas: parallel, sections, section.
The assignment of sections to threads is nondeterministic.
IMNSHO, OpenMP considerably easier than pthreads, fork, etc.
http://stackoverflow.com/questions/13788638/difference-between-section-and-task-openmp
OpenMP is weak on compiling for GPUs. So....