PAR Class 8, Thurs 2019-02-07

W Randolph Franklin, RPI

2019-02-07 00:00

Table of contents::

1 NVidia device architecture, CUDA ctd
2 Nvidia conceptual hierarchy
3 GPU range of speeds
4 CUDA
- 4.1 Versions
- 4.2 Stanford lectures
- 4.3 Misc

1 NVidia device architecture, CUDA ctd

The above shared memory model hits a wall; CUDA handles the other side of the wall.
See Stanford's parallel course notes, which they've made freely available. (Thanks.)
1. They're part of Nvidia's Academic programs site.
2. They are very well done.
3. They are obsolete in some places, which I'll mention.
4. However, a lot of newer CUDA material is also obsolete.
5. Nevertheless, the principles are still mostly valid.

2 Nvidia conceptual hierarchy

As always, this is as I understand it, and could be wrong. Nvidia uses their own terminology inconsistently. They may use one name for two things (E.g., Tesla and GPU), and may use two names for one thing (e.g., module and accelerator). As time progresses, they change their terminology.

At the bottom is the hardware micro-architecture. This is an API that defines things like the available operations. The last several Nvidia micro-architecture generations are, in order, Tesla (which introduced unified shaders), Fermi, Kepler, Maxwell (introduced in 2014), Pascal (2016), and Volta (2018).
Each micro-architecture is implemented in several different microprocessors. E.g., the Kepler micro-architecture is embodied in the GK107, GK110, etc. Pascal is GP104 etc. The second letter describes the micro-architecture. Different microprocessors with the same micro-architecture may have different amounts of various resources, like the number of processors and clock rate.
To be used, microprocessors are embedded in graphics cards, aka modules or accelerators, which are grouped into series such as GeForce, Quadro, etc. Confusingly, there is a Tesla computing module that may use any of the Tesla, Fermi, or Kepler micro-architectures. Two different modules using the same microprocessor may have different amounts of memory and other resources. These are the components that you buy and insert into a computer. A typical name is GeForce GTX1080.
There are many slightly different accelerators with the same architecture, but different clock speeds and memory, e.g. 1080, 1070, 1060, ...
The same accelerator may be manufactured by different vendors, as well as by Nvidia. These different versions may have slightly different parameters. Nvidia's reference version may be relatively low performance.
The term GPU sometimes refers to the microprocessor and sometimes to the module.
There are four families of modules: GeForce for gamers, Quadro for professionals, Tesla for computation, and Tegra for mobility.
Nvidia uses the term Tesla in two unrelated ways. It is an obsolete architecture generation and a module family.
Geoxeon has a (Maxwell) GeForce GTX Titan and a (Kepler) Tesla K20xm. Parallel has a (Pascal) GeForce GTX 1080. We also have an unused (Kepler) Quadro K5000.
Since the highest-end (Tesla) modules don't have video out, they are also called something like compute modules.

3 GPU range of speeds

Here is an example of the wide range of Nvidia GPU speeds; all times are +-20%.

The GTX 1080 has 2560 CUDA cores @ 1.73GHz and 8GB of memory. matrixMulCUBLAS runs at 3136 GFlops. However the reported time (0.063 msec) is so small that it may be inaccurate. The quoted speed of the 1080 is about triple that. I'm impressed that the measured performance is so close.

The Quadro K2100M in my Lenovo W540 laptop has 576 CUDA cores @ 0.67 GHz and 2GB of memory. matrixMulCUBLAS runs at 320 GFlops. The time on the GPU was about .7 msec, and on the CPU 600 msec.

It's nice that the performance almost scaled with the number of cores and clock speed.

4 CUDA

4.1 Versions

CUDA has a capability version, whose major number corresponds to the micro-architecture generation. Kepler is 3.x. The K20xm is 3.5. The GTX 1080 is 6.1. Here is a table of the properties of different compute capabilities. However, that table is not completely consistent with what deviceQuery shows, e.g., the shared memory size.
nvcc, the CUDA compiler, can be told which capabilities (aka architectures) to compile for. They can be given as a real architecture, e.g., sm_61, or a virtual architecture. e.g., compute_61.

Just use the option -arch=compute_61.
The CUDA driver and runtime also have a software version, defining things like available C++ functions. The latest is 9.1. This is unrelated to the capability version.

4.2 Stanford lectures

On the web server.
On geoxeon: /parallel-class/stanford/lectures/
Lecture 4: how to
1. cache data into shared memory for speed, and
2. use hierarchical sync.

4.3 Misc

With CUDA, the dominant problem in program optimization is optimizing the data flow. Getting the data quickly to the cores is harder than processing it. It helps big to have regular arrays, where each core reads or writes a successive entry.

This is analogous to the hardware fact that wires are bigger (hence, more expensive) than gates.
That is the opposite optimization to OpenMP, where having different threads writing to adjacent addresses will cause the false sharing problem.
Nvidia CUDA FAQ
1. has links to other Nvidia docs.
2. is a little old. Kepler and Fermi are 2 and 3 generations old.