PAR Lecture 14, Mon Mar 6

W Randolph Franklin, RPI

2017-03-03 00:00

Comments

Table of contents

1 Web site
2 parallel.ecse hardware details
3 Term project
4 Student presentations about GTC
5 Nvidia GPU summary
6 More CUDA
7 Thrust

1 Web site

I added Disqus comments. This is a way for you to ask questions and discuss thing with other class members.
I revised the tag categories. Now homeworks and lectures can be listed separately.

2 parallel.ecse hardware details

I put the invoice on parallel.ecse in /parallel-class/ . It gives the hardware specifics.

3 Term project

Summary in syllabus.

4 Student presentations about GTC

JA, the Shaping Light with GPUs
LS, High-Level GPU Programming Using OpenMP 4.5 and Clang/LLVM
SKL, GPU-Accelerated Motion Input Analysis for Motion Reconstruction
WB
BC
FC
AF
AG
BL
TL

5 Nvidia GPU summary

Here's a summary of the Nvidia Pascal GP104 GPU architecture as I understand it. It's more compact than I've found elsewhere. I'll add to it from time to time. Some numbers are probably wrong.

The host is the CPU.
The device is the GPU.
The device contains 20 streaming multiprocessors (SMs).

Different GPU generations have used the terms SMX or SMM.
A thread is a sequential program with private and shared memory, program counter, etc.
Threads are grouped, 32 at a time, into warps.
Warps of threads are grouped into blocks.

Often the warps are only implicit, and we consider that the threads are grouped directly into blocks.

That abstract hides details that may be important; see below.
Blocks of threads are grouped into a grid, which is all the threads in the kernel.
A kernel is a parallel program executing on the device.
1. The kernel runs potentially thousands of threads.
2. A kernel can create other kernels and wait for their completion.
3. There may be a limit, e.g., 5 seconds, on a kernel's run time.
Thread-level resources:
1. Each thread can use up to 255 fast registers. Registers are private to the thread.
  
  All the threads in one block have their registers allocated from a fixed pool of 65536 registers. The more registers that each thread uses, the fewer warps in the block can run simultaneously.
2. Each thread has 512KB slow local memory, allocated from the global memory.
3. Local memory is used when not enough registers are available, and to store thread-local arrays.
Warp-level resources:
1. Threads are grouped, 32 at a time, into warps.
2. Each warp executes as a SIMD, with one instruction register. At each cycle, every thread in a warp is either executing the same instruction, or is disabled. If the 32 threads want to execute 32 different instructions, then they will execute one after the other, sequentially.
  
  If you read in some NVidia doc that threads in a warp run independently, then continue reading the next page to get the info mentioned in the previous paragraph.
3. If successive instructions in a warp do not depend on each other, then, if there are enough warp schedulers available, they may be executed in parallel. This is called Instruction Level Parallelism (ILP).
4. For an array in local memory, which means that each thread will have its private copy, the elements for all the threads in a warp are interleaved to potentially increase the I/O rate.
  
  Therefore your program should try to have successive threads read successive words of arrays.
5. A thread can read variables from other threads in the same warp, with the shuffle instruction. Typical operation are to read from the K-th next thread, to do a butterfly permutation, or to do an indexed read. This happens in parallel for the whole warp, and does not use shared memory.
6. A warp vote combines a bit computed by each thread to report results like all or any.
Block-level resources:
1. A block may contain up to 1024 threads.
2. Each block has access to 65536 fast 32-bit registers, for the use of its threads.
3. Each block can use up to 49152 bytes of the SM's fast shared memory. The block's shared memory is shared by all the threads in the block, but is hidden from other blocks.
  
  Shared memory is basically a user-controllable cache of some global data. The saving comes from reusing that shared data several times after you loaded it from global memory once.
  
  Shared memory is interleaved in banks so that some access patterns are faster than others.
4. Warps in a block run asynchronously and run different instructions. They are scheduled and executed as resources are available.
5. The threads in a block can be synchonized with __syncthreads().
  
  Because of how warps are scheduled, that can be slow.
6. The threads in a block can be arranged into a 3D array, up to 1024x1024x64.
  
  That is for convenience, and does not increase performance (I think).
7. I'll talk about textures later.
Streaming Multiprocessor (SM) - level resources:
1. Each SM has 128 single-precision CUDA cores, 64 double-precision units, 32 special function units, and 32 load/store units.
2. In total, the GPU has 2560 CUDA cores.
3. A CUDA core is akin to an ALU. The cores, and all the units, are pipelined.
4. A CUDA core is much less powerful than one core of an Intel Xeon. My guess is 1/20th.
5. Beware that, in the CUDA C Programming Guide, NVidia sometimes calls an SM a core.
6. The limited number of, e.g., double precision units means that an DP instruction will need to be scheduled several times for all the threads to execute it. That's why DP is slower.
7. Each SM has 4 warp schedulers and 8 instruction dispatch units.
8. 64 warps can simultaneously reside in an SM.
9. Therefore up to 32x64=2048 threads can be executed in parallel by an SM.
10. Up to 16 blocks that can simultaneously be resident in an SM.
  
  However, if each block uses too many resources, like shared memory, then this number is reduced.
  
  Each block sits on only one SM; no block is split. However a block's warps are executed asynchronously (until synced).
11. Each SM has 64KiB (?) fast memory to be divided between shared memory and an L1 cache. Typically, 48KiB (96?) is used for the shared memory, to be divided among its resident blocks, but that can be changed.
12. The 48KB L1 cache can cache local or global memory.
13. Each SM has a read-only data cache of 48KB to cache the global constant memory.
14. Each SM has 8 texture units, and many other graphics capabilities.
15. Each SM has 256KB of L2 cacha.
Grid-level resources:
1. The blocks in a grid can be arranged into a 3D array. up to \((2^{31}-1, 2^{16}-1, 2^{16}-1)\).
2. Blocks in a grid might run on different SMs.
3. Blocks in a grid are queued and executed as resources are available, in an unpredictable parallel or serial order. Therefore they should be independent of each other.
4. The number of instructions in a kernel is limited.
5. Any thread can stop the kernel by calling assert.
Device-level resources:
1. There is a large and slow 8GB global memory, which persists from kernel to kernel.
  
  Transactions to global memory are 128 bytes.
  
  Host memory can also be memory-mapped into global memory, although the I/O rate will be lower.
  
  Reading from global memory can take hundreds of cycles. A warp that does this will be paused and another warp started. Such context switching is very efficient. Therefore device throughput stays high, although there is a latency. This is called Thread Level Parallelism (TLP) and is a major reason for GPU performance.
  
  That assumes that an SM has enough active warps that there is always another warp available for execution. That is a reason for having warps that do not use all the resources (registers etc) that they're allowed to.
2. There is a 2MB L2 cache, for sharing data between SMs.
3. There is a 64KiB Small and fast global constant memory, , which also persists from kernel to kernel. It is implemented as a piece of the global memory, made fast with caches.
  
  (Again, I'm still resolving this apparent contradiction).
4. Grid Management Unit (GMU) schedules (pauses, executes, etc) grids on the device. This is more important because grids can start other grids (Dynamic Parallelism).
5. Hyper-Q: 32 simultaneous CPU tasks can launch kernels into the queue; they don't block each other. If one kernel is waiting, another runs.
6. CUDA Work Distributor (CWD) dispatches 32 active grids at a time to the SMs. There may be 1000s of grids queued and waiting.
7. GPU Direct: Other devices can DMA the GPU memory.
8. The base clock is 1607MHz.
9. GFLOPS: 8873.
10. Memory bandwidth: 320GB/s
GPU-level resources:
1. Being a Geforce product, there are many graphics facilities that we're not using.
2. There are 4 Graphics processing clusters (GPCs) to do graphics stuff.
3. Several perspective projections can be computed in parallel, for systems with several displays.
4. There's HW for texture processing.
Generational changes:
1. With each new version, Nvidia tweaks the numbers. Some get higher, others get lower.
  1. E.g., Maxwell had little HW for double precision, and so that was slow.
  2. Pascal's clock speed is much higher.
Refs:
1. The CUDA program deviceDrv.
2. http://developer.download.nvidia.com/compute/cuda/compute-docs/cuda-performance-report.pdf
3. http://international.download.nvidia.com/geforce-com/international/pdfs/GeForce_GTX_1080_Whitepaper_FINAL.pdf
4. Better Performance at Lower Occupancy, Vasily Volkov, UC Berkeley, 2010.
5. https://www.pgroup.com/lit/articles/insider/v2n1a5.htm - well written but old.
(I'll keep adding to this. Suggestions are welcome.)

6 More CUDA

CUDA function qualifiers:
1. __global__ device function called from host, starting a kernel.
2. __device__ device function called from device function.
3. __host__ (default) host function called from host function.
CUDA variable qualifiers:
1. __shared__
2. __device__ global
3. __device__ __managed__ automatically paged between host and device.
4. __constant__
5. (nothing) register if scalar, or local if array or if no more registers available.
If installing CUDA on your machine, this repository seems best:

http://developer.download.nvidia.com/compute/cuda/repos/ubuntu1604/x86_64

That includes the Thrust headers but not example programs.

7 Thrust

Thrust is an API that looks like STL. Its backend can be CUDA, OpenMP, or sequential host-based code.
The online Thrust directory structure is a mess. Three main sites appear to be these:
1. https://github.com/thrust -
  1. The best way to install it is to clone from here.
  2. The latest version of the examples is also here.
  3. The wiki has a lot of doc.
2. https://thrust.github.io/
  
  This points to the above site.
3. https://developer.nvidia.com/thrust
  
  This has links to other Nvidia docs, some of which are obsolete.
4. http://docs.nvidia.com/cuda/thrust/index.html
  
  easy-to-read, thorough, obsolete, doc
5. https://code.google.com/ - no longer exists.
The latest version is 1.8.3.
Functional-programming philosophy.
Many possible backends: host, GPU, OpenMP, TBB...
Easier programming, once you get used to it.
Code is efficient.
Uses some unusual C++ techniques, like overloading operator().
Since the Stanford slides were created, Thrust has adopted unified addressing, so that pointers know whether they are host or device.
On parallel in /parallel-class/thrust/ are many little demo programs from the thrust distribution, with my additions.
CUDACast videos on Thrust:

CUDACast #.15 - Introduction to Thrust

CUDACast #.16 - Thrust Algorithms and Custom Operators
Thrust is fast because the functions that look like they would need linear time really take only log time in parallel.
In functions like reduce and transform, you often see an argument like thrust::multiplies<float>(). The syntax is as follows:
1. thrust::multiplies<float> is a class.
2. It overloads operator().
3. However, in the call to reduce, thrust::multiplies<float>() is calling the default constructor to construct a variable of class thrust::multiplies<float>, and passing it to reduce.
4. reduce will treat its argument as a function name and call it with an argument, triggering operator().
5. You may also create your own variable of that class, e.g., thrust::multiplies<float> foo. Then you just say foo in the argument list, not foo().
6. The optimizing compiler will replace the operator() function call with the defining expression and then continue optimizing. So, there is no overhead, unlike if you passed in a pointer to a function.
Sometimes, e.g., in saxpy.cu, you see saxpy_functor(A).
1. The class saxpy_functor has a constructor taking one argument.
2. saxpy_functor(A) constructs and returns a variable of class saxpy_functor and stores A in the variable.
3. The class also overloads operator().
4. (Let's call the new variable foo). foo() calls operator() for foo; its execution uses the stored A.
5. Effectively, we did a closure of saxpy_functor; this is, we bound a property and returned a new, more restricted, variable or class.
The Thrust examples teach several non-intuitive paradigms. As I figure them out, I'll describe a few. My descriptions are modified and expanded versions of the comments in the programs. This is not a list of all the useful programs, but only of some where I am adding to their comments.
1. arbitrary_transformation.cu and dot_products_with_zip.cu. show the very useful zip_iterator. Using it is a 2-step process.
  1. Combine the separate iterators into a tuple.
  2. Construct a zip iterator from the tuple.
  Note that operator() is now a template.
2. boundingbox.cu finds the bounding box around a set of 2D points.
  
  The main idea is to do a reduce. However, the combining operation, instead of addition, is to combine two bounding boxes to find the box around them.
3. bucket_sort2d.cu overlays a grid on a set of 2D points and finds the points in each grid cell (bucket).
  1. The tuple is an efficient class for a short vector of fixed length.
  2. Note how random numbers are generated. You combine an engine that produces random output with a distribution.
    
    However you might need more complicated coding to make the numbers good when executing in parallel. See monte_carlo_disjoint_sequences.cu.
  3. The problem is that the number of points in each cell is unpredictable.
  4. The cell containing each point is computed and that and the points are sorted to bring together the points in each cell.
  5. Then lower_bound and upper_bound are used to find each bucket in that sorted vector of points.
  6. See this lower_bound description.
4. mode.cu shows:
  1. Counting the number of unique keys in a vector.
    1. Sort the vector.
    2. Do an inner_product. However, instead of the operators being times and plus, they are not equal to the next element and plus.
  2. Counting their multiplicity.
    1. Construct vectors, sized at the number of unique keys, to hold the unique keys and counts.
    2. Do a reduce_by_keys on a constant_iterator using the sorted vector as the keys. For each range of identical keys, it sums the constant_iterator. That is, it counts the number of identical keys.
    3. Write a vector of unique keys and a vector of the counts.
  3. Finding the most used key (the mode).
    1. Do max_element on the counts vector.
5. repeated_range.cu repeats each element of an N-vector K times: repeated_range([0, 1, 2, 3], 2) -> [0, 0, 1, 1, 2, 2, 3, 3]. It's a lite version of expand.cu, but uses a different technique.
  1. Here, N=4 and K=2.
  2. The idea is to construct a new iterator, repeated_range, that, when read and incremented, will return the proper output elements.
  3. The construction stores the relevant info in structure components of the variable.
  4. Treating its value like a subscript in the range [0,N*K), it divides that value by K and returns that element of its input.
  See also strided_range.cu and tiled_range.cu.