PAR Class 11, Thu 2020-02-20

W Randolph Franklin (WRF), Rensselaer Polytechnic Institute (RPI)

2020-02-19 00:00

Source

Table of contents

1 Term project
2 No class Thu Feb 27
3 Student talks in class, round 2
4 Nvidia conceptual hierarchy
5 GPU range of speeds
6 CUDA
- 6.1 Versions
- 6.2 Misc
7 Nvidia GPU and accelated computing, ctd.

1 Term project

See the syllabus.
Proposal due March
Progress reports on March 26, April 9 (Thu)
Presentations in class April 20 (Mon), 23 (Thu), 27 (Mon).
Paper etc due on Wed April 29 (last class).

2 No class Thu Feb 27

A group of us will be visiting IBM's Quantum Computing group.

On Monday I'll make suggestions for how to occupy your time.

3 Student talks in class, round 2

Choose a topic (or group of topics) from the 589 on-demand sessions at the GPU Tech Conference.
Summarize it in class. OK to show the whole presentation (or parts of it) with your commentary.
Groups of two people are allowed. However, this time, please keep your talks to 6 min.
Presentation dates: March 2 and 5.
No need to sign up for your topic since there'll probably be few overlaps. However here's a signup site for your presentation date.

https://doodle.com/poll/hwqrmeyxgsvk2mzk

I've allowed up to 11 talks per day.

Please enter your name and rcsid.

As always, this is as I understand it, and could be wrong. Nvidia uses their own terminology inconsistently. They may use one name for two things (E.g., Tesla and GPU), and may use two names for one thing (e.g., module and accelerator). As time progresses, they change their terminology.

At the bottom is the hardware micro-architecture. This is an API that defines things like the available operations. The last several Nvidia micro-architecture generations are, in order, Tesla (which introduced unified shaders), Fermi, Kepler, Maxwell (introduced in 2014), Pascal (2016), and Volta (2018).
Each micro-architecture is implemented in several different microprocessors. E.g., the Kepler micro-architecture is embodied in the GK107, GK110, etc. Pascal is GP104 etc. The second letter describes the micro-architecture. Different microprocessors with the same micro-architecture may have different amounts of various resources, like the number of processors and clock rate.
To be used, microprocessors are embedded in graphics cards, aka modules or accelerators, which are grouped into series such as GeForce, Quadro, etc. Confusingly, there is a Tesla computing module that may use any of the Tesla, Fermi, or Kepler micro-architectures. Two different modules using the same microprocessor may have different amounts of memory and other resources. These are the components that you buy and insert into a computer. A typical name is GeForce GTX1080.
There are many slightly different accelerators with the same architecture, but different clock speeds and memory, e.g. 1080, 1070, 1060, ...
The same accelerator may be manufactured by different vendors, as well as by Nvidia. These different versions may have slightly different parameters. Nvidia's reference version may be relatively low performance.
The term GPU sometimes refers to the microprocessor and sometimes to the module.
There are at least four families of modules: GeForce for gamers, Quadro for professionals, Tesla for computation, and Tegra for mobility.
Nvidia uses the term Tesla in two unrelated ways. It is an obsolete architecture generation and a module family.
Geoxeon has a (Maxwell) GeForce GTX Titan and a (Kepler) Tesla K20xm. Parallel has a (Volta) RTX 8000 and (Pascal) GeForce GTX 1080. We also have an unused (Kepler) Quadro K5000.
Since the highest-end (Tesla) modules don't have video out, they are also called something like compute modules.

5 GPU range of speeds

Here is an example of the wide range of Nvidia GPU speeds; all times are +-20%.

The Quadro RTX 8000 has 4608 CUDA cores @ 1.77GHz and 48GB of memory. matrixMulCUBLAS runs at 5310 GFlops. The specs claim 16 TFlops. However those numbers understate its capabilities because it also has 576 Tensor cores and 72 ray tracing cores to cast 11G rays/sec.

The GeForce GTX 1080 has 2560 CUDA cores @ 1.73GHz and 8GB of memory. matrixMulCUBLAS runs at 3136 GFlops. However the reported time (0.063 msec) is so small that it may be inaccurate. The quoted speed of the 1080 is about triple that. I'm impressed that the measured performance is so close.

The Quadro K2100M in my Lenovo W540 laptop has 576 CUDA cores @ 0.67 GHz and 2GB of memory. matrixMulCUBLAS runs at 320 GFlops. The time on the GPU was about .7 msec, and on the CPU 600 msec.

It's nice that the performance almost scaled with the number of cores and clock speed.

6 CUDA

6.1 Versions

CUDA has a capability version, whose major number corresponds to the micro-architecture generation. Kepler is 3.x. The K20xm is 3.5. The GTX 1080 is 6.1. The RTX 8000 is 7.5. Here is a table of the properties of different compute capabilities. However, that table is not completely consistent with what deviceQuery shows, e.g., the shared memory size.
nvcc, the CUDA compiler, can be told which capabilities (aka architectures) to compile for. They can be given as a real architecture, e.g., sm_61, or a virtual architecture. e.g., compute_61.
The CUDA driver and runtime also have a software version, defining things like available C++ functions. The latest is 10.1. This is unrelated to the capability version.

6.2 Misc

With CUDA, the dominant problem in program optimization is optimizing the data flow. Getting the data quickly to the cores is harder than processing it. It helps big to have regular arrays, where each core reads or writes a successive entry.

This is analogous to the hardware fact that wires are bigger (hence, more expensive) than gates.
That is the opposite optimization to OpenMP, where having different threads writing to adjacent addresses will cause the false sharing problem.
Nvidia CUDA FAQ
1. has links to other Nvidia docs.
2. can be a little old.

7 Nvidia GPU and accelated computing, ctd.

This material accompanies Programming Massively Parallel Processors A Hands-on Approach, Third Edition, David B. Kirk Wen-mei W. Hwu. I recommend it. (The slides etc are free but the book isn't.)

Continuing /parallel-class/GPU-Teaching-Kit-Accelerated-Computing, starting at Modules 17 slide 14.