PAR Lecture 8, Mon Feb 13

W Randolph Franklin, RPI

2017-02-13 00:00

Comments

Table of contents

1 My web site
2 CUDA ideas
3 CUDA info on the web
4 CUDA programs
5 Nvidia conceptual hierarchy
6 CUDA versions
7 GPU range of speeds
8 CUDA

1 My web site

Moved to my own virtual server (Thanks).

This improves security.

Old links redirect.

http: redirects to https: .

You should see no difference; please report any problems.

I'm creating most new pages with the static site generator Nikola.

The big reason is that the pmwiki project is not as dynamic as it was when I picked it. Its cookbooks, which I use a lot, are not being updated as php is updated.
Also, static site generators, which did not exist, are a cool idea.
Finally, nikola is more powerful for most purposes and looks better.
Nevertheless, I'll continue to use pmwiki for collaborative blogs, which nikola doesn't do. (My students report results to me using pmwiki blogs. They are an online version of lab notebooks. I highly recommend this.)

2 CUDA ideas

With CUDA, the dominant problem in program optimization is optimizing the data flow. Getting the data quickly to the cores is harder than processing it. It helps big to have regular arrays, where each core reads or writes a successive entry.

This is analogous to the hardware fact that wires are bigger (hence, more expensive) than gates.
That is the opposite optimization to OpenMP, where having different threads writing to adjacent addresses will cause the false sharing problem.

3 CUDA info on the web

https://developer.nvidia.com/cuda-faq
1. has links to other Nvidia docs.
2. is a little old. Kepler and Fermi are 2 and 3 generations old.

4 CUDA programs

parallel.ecse.rpi.edu now has CUDA 8.
The demo programs are in /local/cuda/samples/ . Their coding style is suboptimal. However, in /local/cuda/samples/1_Utilities/ , bandwidthTest and deviceQuery are interesting.

For your convenience, /parallel-class/deviceQuery is a link. Run it to see the GPU's capabilities.
I will try to use parallel for CUDA because it has a Pascal CPU; geoxeon has a Maxwell and a Kepler GPU.
The program nvidia-smi shows the current load on the GPU.
My web copy of the tutorial programs from Stanford's parallel course notes is also on parallel at /parallel-class/stanford/tutorials/ .
1. I've edited some of them, and put the originals in orig/ , and created new ones.
2. To compile them, you need /local/cuda/bin in your PATH and /local/cuda/lib64 in your LD_LIBRARY_PATH .
3. Name your source program foo.cu for some foo .
4. Compile it thus: nvcc foo.cu -o foo .
5. hello_world.cu shows a simple CUDA program and uses a hack to print from a device function.
6. hello_world2.cu shows printing from several threads.
7. global_functions.cu shows some basic CUDA stuff.
8. device_functions.cu extends it.
9. vector_addition.cu does (you figure it out).
10. vector_addition2.cu is my modification to use unified memory, per http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html . I also cleaned up the code and shrank the number of lines for better display.
  
  IMO, unified memory makes programming a lot easier.
  
  Notes:
  1. In linux, what's the easiest way to find the smallest prime larger than a given number?
  2. To find the number of blocks needed for N threads, you can do it the Stanford way:
    
    grid_size = num_elements / block_size; if(num_elements % block_size) ++grid_size;
    
    or you can do it the RPI (i.e., my) way:
    
    grid_size = (num_elements + block_size - 1) / block_size;

As always, this is as I understand it, and could be wrong. Nvidia uses their own terminology inconsistently. They may use one name for two things (E.g., Tesla and GPU), and may use two names for one thing (e.g., module and accelerator). As time progresses, they change their terminology.

At the bottom is the hardware micro-architecture. This is an API that defines things like the available operations. The last several Nvidia micro-architecture generations are, in order, Tesla (which introduced unified shaders), Fermi, Kepler, Maxwell (introduced in 2014), Pascal (2016), and Volta (future).
Each micro-architecture is implemented in several different microprocessors. E.g., the Kepler micro-architecture is embodied in the GK107, GK110, etc. Pascal is GP104 etc. The second letter describes the micro-architecture. Different microprocessors with the same micro-architecture may have different amounts of various resources, like the number of processors and clock rate.
To be used, microprocessors are embedded in graphics cards, aka modules or accelerators, which are grouped into series such as GeForce, Quadro, etc. Confusingly, there is a Tesla computing module that may use any of the Tesla, Fermi, or Kepler micro-architectures. Two different modules using the same microprocessor may have different amounts of memory and other resources. These are the components that you buy and insert into a computer. A typical name is GeForce GTX1080.
There are many slightly different accelerators with the same architecture, but different clock speeds and memory, e.g. 1080, 1070, 1060, ...
The same accelerator may be manufactured by different vendors, as well as by Nvidia. These different versions may have slightly different parameters. Nvidia's reference version may be relatively low performance.
The term GPU sometimes refers to the microprocessor and sometimes to the module.
There are four families of modules: GeForce for gamers, Quadro for professionals, Tesla for computation, and Tegra for mobility.
Nvidia uses the term Tesla in two unrelated ways. It is an obsolete architecture generation and a module family.
Geoxeon has a (Maxwell) GeForce GTX Titan and a (Kepler) Tesla K20xm. Parallel has a (Pascal) GeForce GTX 1080. We also have an unused (Kepler) Quadro K5000.
Since the highest-end (Tesla) modules don't have video out, they are also called something like compute modules.

6 CUDA versions

CUDA has a capability version, whose major number corresponds to the micro-architecture generation. Kepler is 3.x. The K20xm is 3.5. The GTX 1080 is 6.1. Here is a table of the properties of different compute capabilities. However, that table is not completely consistent with what deviceQuery shows, e.g., the shared memory size.
nvcc, the CUDA compiler, can be told which capabilities (aka architectures) to compile for. They can be given as a real architecture, e.g., sm_61, or a virtual architecture. e.g., compute_61.

Just use the option -arch=compute_61.
The CUDA driver and runtime also have a software version, defining things like available C++ functions. The latest is 8.0. This is unrelated to the capability version.

7 GPU range of speeds

Here is an example of the wide range of Nvidia GPU speeds; all times are +-20%.

The GTX 1080 has 2560 CUDA cores @ 1.73GHz and 8GB of memory. matrixMulCUBLAS runs at 3136 GFlops. However the reported time (0.063 msec) is so small that it may be inaccurate. The quoted speed of the 1080 is about triple that. I'm impressed that the measured performance is so close.

The Quadro K2100M in one of my Lenovo W540 laptop has 576 CUDA cores @ 0.67 GHz and 2GB of memory. matrixMulCUBLAS runs at 320 GFlops. The time on the GPU was about .7 msec, and on the CPU 600 msec.

It's nice that the performance almost scaled with the number of cores and clock speed.

8 CUDA

Last time we saw 2 ways to create managed variables. They can be accessed by either the host or the device and are paged automatically. This makes programming much easier.

Create static variables with __device__ __managed__. See /parallel-class/stanford/tutorials/vector_addition2.cu on parallel.
Use cudaMallocManaged. See /parallel-class/stanford/tutorials/vector_addition3.cu on parallel.
In either case, you need to call cudaDeviceSynchronize(); on the host after starting a parallel kernel before reading the data on the host. The reason is that the kernel is started asynchonously and control returns while it is still executing.
When the linux kernel gets HMM (heterogeneous memory management), all data on the heap will automatically be managed.
The reason is that virtual addresses are long enough to contain a tag saying what device they are on. The VM page mapper will read and write pages to various devices, not just swap files.
Any CUDA example using cudaMemcpy is now obsolete (on Pascal GPUs).