PAR Class 6, Wed 2018-02-21

W Randolph Franklin, RPI

2018-02-15 00:00

Source

Table of contents

1 Narayanaswami Chandrasekhar talk on Blockchains
2 Optional Homework - bring your answers and discuss next week
- 2.1 Paper questions
- 2.2 Programming questions
3 Stanford lectures
4 Misc CUDA
5 Managed Variables
6 Doc
7 Managed memory issues
8 Misc hints
- 8.1 Vim

1 Narayanaswami Chandrasekhar talk on Blockchains

Class will be short today because of this.

2 Optional Homework - bring your answers and discuss next week

2.1 Paper questions

Research and then describe the main changes from NVidia Maxwell to Pascal.
Although a thread can use 255 registers, that might be bad for performance. Why?
Give a common way that the various threads in a block can share data with each other.
Reading a word from global memory might take 400 cycles. Does that mean that a thread that reads many words from global memory will always take hundreds of times longer to complete?
Since the threads in a warp are executed in a SIMD fashion, how can an if-then-else block be executed?
What is unified virtual addressing and how does it make CUDA programming easier?

2.2 Programming questions

Repeat homework 2's matrix multiplication problem, this time in CUDA. Report how much parallel speedup you get.
Look at the dataset /parallel-class/data/bunny. It contains 35947 points for the Stanford bunny.

Assuming that each point has a mass of 1, and is gravitationally attracted to the others, compute the potential energy of the system. The formula is this:

\(U = - \sum_{i=1}^{N-1} \sum_{j=i+1}^N \frac{1}{r_{ij}}\)

where \(r_{ij}\) is the distance between points \(i\) and \(j\) . (This assumes that G=1).
Now look at the dataset /parallel-class/data/blade, which contains 882954 points for a turbine blade. Can you process it?

3 Stanford lectures

Lecture 5 performance considerations shows how to fine tune your program once it's already working, if you need the extra speed.
Lecture 6 parallel patterns 1 presents some paradigms of parallel programming. These are generally useful building blocks for parallel algorithms.

4 Misc CUDA

The demo programs are in /local/cuda/samples/ . Their coding style is suboptimal. However, in /local/cuda/samples/1_Utilities/ , bandwidthTest and deviceQuery are interesting.

For your convenience, /parallel-class/deviceQuery is a link. Run it to see the GPU's capabilities.
The program nvidia-smi shows the current load on the GPU.
My web copy of the tutorial programs from Stanford's parallel course notes is also on parallel at /parallel-class/stanford/tutorials/ .
1. I've edited some of them, and put the originals in orig/ , and created new ones.
2. To compile them, you need /local/cuda/bin in your PATH and /local/cuda/lib64 in your LD_LIBRARY_PATH .
3. Name your source program foo.cu for some foo .
4. Compile it thus: nvcc foo.cu -o foo .
5. hello_world.cu shows a simple CUDA program and uses a hack to print from a device function.
6. hello_world2.cu shows printing from several threads.
7. global_functions.cu shows some basic CUDA stuff.
8. device_functions.cu extends it.
9. vector_addition.cu does (you figure it out).
10. vector_addition2.cu is my modification to use unified memory, per http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html . I also cleaned up the code and shrank the number of lines for better display.
  
  IMO, unified memory makes programming a lot easier.
  
  Notes:
  1. In linux, what's the easiest way to find the smallest prime larger than a given number?
  2. To find the number of blocks needed for N threads, you can do it the Stanford way:
    
    grid_size = num_elements / block_size; if(num_elements % block_size) ++grid_size;
    
    or you can do it the RPI (i.e., my) way:
    
    grid_size = (num_elements + block_size - 1) / block_size;

5 Managed Variables

Last time we saw 2 ways to create managed variables. They can be accessed by either the host or the device and are paged automatically. This makes programming much easier.

Create static variables with __device__ __managed__. See /parallel-class/stanford/tutorials/vector_addition2.cu on parallel.
Use cudaMallocManaged. See /parallel-class/stanford/tutorials/vector_addition3.cu on parallel.
In either case, you need to call cudaDeviceSynchronize(); on the host after starting a parallel kernel before reading the data on the host. The reason is that the kernel is started asynchonously and control returns while it is still executing.
When the linux kernel gets HMM (heterogeneous memory management), all data on the heap will automatically be managed.
The reason is that virtual addresses are long enough to contain a tag saying what device they are on. The VM page mapper will read and write pages to various devices, not just swap files.
Any CUDA example using cudaMemcpy is now obsolete (on Pascal GPUs).

6 Doc

Nvidia's CUDA programming guide is excellent, albeit obsolescent in places. The Pascal info looks like it's been tacked onto an older document.

The whitepaper NVIDIA GeForce GTX 1080 describes, from a gaming point of view, the P104 GPU, which is in the GTX 1080, the card in parallel.ecse.

NVIDIA now has a higher level GPU, the P100, described in the P100 whitepaper and P100 technical overview. Note that the P100 is a Tesla (scientific computing) not a GeForce (gaming). This description is much more technical.

:se nu

It will be read everytime vim starts, and will set line number mode.