PAR Class 5, Wed 2018-02-14

W Randolph Franklin, RPI

2018-02-14 00:00

Table of contents

1 Optional Homework - bring your answers and discuss next week
2 Nvidia conceptual hierarchy
3 GPU range of speeds
4 CUDA
- 4.1 Versions
- 4.2 Stanford lectures
- 4.3 Misc

1 Optional Homework - bring your answers and discuss next week

Write a program to multiply two 100x100 matrices. Do it the conventional way, not using anything fancy like Schonhage-Strassen. Now, see how much improvement you can get with OpenMP. Measure only the elapsed time for the multiplication, not for the matrix initialization.

Report these execution times.
1. W/o openmp enabled (Don't use -fopenmp. Comment out the pragmas.)
2. With openmp, using only 1 thread.
3. Using 2, 4, 8, 16, 32, 64 threads.
Write programs to test the effect of the reduction pragma:
1. Create an array of 1,000,000,000 floats and fill it with pseudorandom numbers from 0 to 1.
2. Do the following tests with 1, 2, 4, 8, 16, and 32 threads.
Programs to write and test:
1. Sum it with a simple for loop. This will give a wrong answer with more than 1 thread, but is fast.
2. Sum it with the subtotal variable protected with a atomic pragma.
3. Sum it with the subtotal variable protected with a critical pragma.
4. Sum it with a reduction loop.
Devise a test program to estimate the time to execute a task pragma. You might start with use taskfib.cc.
Sometime parallelizing a program can increase its elapsed time. Try to create such an example, with 2 threads being slower than 1.

2 Nvidia conceptual hierarchy

As always, this is as I understand it, and could be wrong. Nvidia uses their own terminology inconsistently. They may use one name for two things (E.g., Tesla and GPU), and may use two names for one thing (e.g., module and accelerator). As time progresses, they change their terminology.

At the bottom is the hardware micro-architecture. This is an API that defines things like the available operations. The last several Nvidia micro-architecture generations are, in order, Tesla (which introduced unified shaders), Fermi, Kepler, Maxwell (introduced in 2014), Pascal (2016), and Volta (2018).
Each micro-architecture is implemented in several different microprocessors. E.g., the Kepler micro-architecture is embodied in the GK107, GK110, etc. Pascal is GP104 etc. The second letter describes the micro-architecture. Different microprocessors with the same micro-architecture may have different amounts of various resources, like the number of processors and clock rate.
To be used, microprocessors are embedded in graphics cards, aka modules or accelerators, which are grouped into series such as GeForce, Quadro, etc. Confusingly, there is a Tesla computing module that may use any of the Tesla, Fermi, or Kepler micro-architectures. Two different modules using the same microprocessor may have different amounts of memory and other resources. These are the components that you buy and insert into a computer. A typical name is GeForce GTX1080.
There are many slightly different accelerators with the same architecture, but different clock speeds and memory, e.g. 1080, 1070, 1060, ...
The same accelerator may be manufactured by different vendors, as well as by Nvidia. These different versions may have slightly different parameters. Nvidia's reference version may be relatively low performance.
The term GPU sometimes refers to the microprocessor and sometimes to the module.
There are four families of modules: GeForce for gamers, Quadro for professionals, Tesla for computation, and Tegra for mobility.
Nvidia uses the term Tesla in two unrelated ways. It is an obsolete architecture generation and a module family.
Geoxeon has a (Maxwell) GeForce GTX Titan and a (Kepler) Tesla K20xm. Parallel has a (Pascal) GeForce GTX 1080. We also have an unused (Kepler) Quadro K5000.
Since the highest-end (Tesla) modules don't have video out, they are also called something like compute modules.

3 GPU range of speeds

Here is an example of the wide range of Nvidia GPU speeds; all times are +-20%.

The GTX 1080 has 2560 CUDA cores @ 1.73GHz and 8GB of memory. matrixMulCUBLAS runs at 3136 GFlops. However the reported time (0.063 msec) is so small that it may be inaccurate. The quoted speed of the 1080 is about triple that. I'm impressed that the measured performance is so close.

The Quadro K2100M in my Lenovo W540 laptop has 576 CUDA cores @ 0.67 GHz and 2GB of memory. matrixMulCUBLAS runs at 320 GFlops. The time on the GPU was about .7 msec, and on the CPU 600 msec.

It's nice that the performance almost scaled with the number of cores and clock speed.

4 CUDA

4.1 Versions

CUDA has a capability version, whose major number corresponds to the micro-architecture generation. Kepler is 3.x. The K20xm is 3.5. The GTX 1080 is 6.1. Here is a table of the properties of different compute capabilities. However, that table is not completely consistent with what deviceQuery shows, e.g., the shared memory size.
nvcc, the CUDA compiler, can be told which capabilities (aka architectures) to compile for. They can be given as a real architecture, e.g., sm_61, or a virtual architecture. e.g., compute_61.

Just use the option -arch=compute_61.
The CUDA driver and runtime also have a software version, defining things like available C++ functions. The latest is 9.1. This is unrelated to the capability version.

4.2 Stanford lectures

On the web server.
On geoxeon: /parallel-class/stanford/lectures/
Lecture 4: how to
1. cache data into shared memory for speed, and
2. use hierarchical sync.

4.3 Misc

With CUDA, the dominant problem in program optimization is optimizing the data flow. Getting the data quickly to the cores is harder than processing it. It helps big to have regular arrays, where each core reads or writes a successive entry.

This is analogous to the hardware fact that wires are bigger (hence, more expensive) than gates.
That is the opposite optimization to OpenMP, where having different threads writing to adjacent addresses will cause the false sharing problem.
Nvidia CUDA FAQ
1. has links to other Nvidia docs.
2. is a little old. Kepler and Fermi are 2 and 3 generations old.