PAR Class 15, Thurs 2021-03-18

W Randolph Franklin (WRF), Rensselaer Polytechnic Institute (RPI)

2021-03-18 00:00

Table of contents

1 Student talks

1.1 Mon

Connor. cloud-based/docker-base parallel computing
Joseph. OpenCL.
Blaine. Kubernetes.
Dan. Nsight.
Junshi. Nvidia debugging.

2 Nvidia GPU and accelerated computing, 9

This is from https://developer.nvidia.com/teaching-kits-downloads This material accompanies Programming Massively Parallel Processors A Hands-on Approach, Third Edition, David B. Kirk Wen-mei W. Hwu. I recommend it. (The slides etc are free but the book isn't.)

My local copy of what I'm using is in /parclass/2021/files/nvidia/GPU-Teaching-Kit-Accelerated-Computing.zip

Today we'll start from Module_20_Related_Programming_Models_OpenCL/

3 OpenCL

Module 20.
Apple's competition to CUDA.
is largely CUDA but they changed the names and made it look like OpenGL and clunkier.
not interesting so long as Nvidia is dominant.

4 Nvidia primary documentation

Generally more up-to-date and accurate, but drier. A little disorganized because it keeps growing. The root is here: https://docs.nvidia.com/

Two major relevant sets are:

5 Other Nvidia features

We've seen almost everything, except:

Texture and surface maps.
ML HW like A=BC+D for 4x4 matrices.
Ray tracing HW, to compute a ray's intersections with boxes.
Cooperative groups: with Ampere, subsets of a warp can synchronize.
Subsets of a GPU can be defined as virtual GPUS, which are walled off from each other.
Memory can be compressed when stored, making a space-time tradeff.
The terminology CUDA core is obsolete. Now, they say that an SM has, perhaps 32 single float units, 32 integer units, 32 CUDA instruction dispatchers, and 16 double float units, etc. Each unit operates independently.

6 Nvidia conceptual hierarchy

As always, this is as I understand it, and could be wrong. Nvidia uses their own terminology inconsistently. They may use one name for two things (E.g., Tesla and GPU), and may use two names for one thing (e.g., module and accelerator). As time progresses, they change their terminology.

At the bottom is the hardware micro-architecture. This is an API that defines things like the available operations. The last several Nvidia micro-architecture generations are, in order, Tesla (which introduced unified shaders), Fermi, Kepler, Maxwell (introduced in 2014), Pascal (2016), and Volta (2018).
Each micro-architecture is implemented in several different microprocessors. E.g., the Kepler micro-architecture is embodied in the GK107, GK110, etc. Pascal is GP104 etc. The second letter describes the micro-architecture. Different microprocessors with the same micro-architecture may have different amounts of various resources, like the number of processors and clock rate.
To be used, microprocessors are embedded in graphics cards, aka modules or accelerators, which are grouped into series such as GeForce, Quadro, etc. Confusingly, there is a Tesla computing module that may use any of the Tesla, Fermi, or Kepler micro-architectures. Two different modules using the same microprocessor may have different amounts of memory and other resources. These are the components that you buy and insert into a computer. A typical name is GeForce GTX1080.
There are many slightly different accelerators with the same architecture, but different clock speeds and memory, e.g. 1080, 1070, 1060, ...
The same accelerator may be manufactured by different vendors, as well as by Nvidia. These different versions may have slightly different parameters. Nvidia's reference version may be relatively low performance.
The term GPU sometimes refers to the microprocessor and sometimes to the module.
There are at least four families of modules: GeForce for gamers, Quadro for professionals, Tesla for computation, and Tegra for mobility.
Nvidia uses the term Tesla in two unrelated ways. It is an obsolete architecture generation and a module family.
Geoxeon has a (Maxwell) GeForce GTX Titan and a (Kepler) Tesla K20xm. Parallel has a (Volta) RTX 8000 and (Pascal) GeForce GTX 1080. We also have an unused (Kepler) Quadro K5000.
Since the highest-end (Tesla) modules don't have video out, they are also called something like compute modules.

7 GPU range of speeds

Here is an example of the wide range of Nvidia GPU speeds; all times are +-20%.

The Quadro RTX 8000 has 4608 CUDA cores @ 1.77GHz and 48GB of memory. matrixMulCUBLAS runs at 5310 GFlops. The specs claim 16 TFlops. However those numbers understate its capabilities because it also has 576 Tensor cores and 72 ray tracing cores to cast 11G rays/sec.

The GeForce GTX 1080 has 2560 CUDA cores @ 1.73GHz and 8GB of memory. matrixMulCUBLAS runs at 3136 GFlops. However the reported time (0.063 msec) is so small that it may be inaccurate. The quoted speed of the 1080 is about triple that. I'm impressed that the measured performance is so close.

The Quadro K2100M in my Lenovo W540 laptop has 576 CUDA cores @ 0.67 GHz and 2GB of memory. matrixMulCUBLAS runs at 320 GFlops. The time on the GPU was about .7 msec, and on the CPU 600 msec.

It's nice that the performance almost scaled with the number of cores and clock speed.

8 CUDA

8.1 Versions

CUDA has a capability version, whose major number corresponds to the micro-architecture generation. Kepler is 3.x. The K20xm is 3.5. The GTX 1080 is 6.1. The RTX 8000 is 7.5. Here is a table of the properties of different compute capabilities. However, that table is not completely consistent with what deviceQuery shows, e.g., the shared memory size.
nvcc, the CUDA compiler, can be told which capabilities (aka architectures) to compile for. They can be given as a real architecture, e.g., sm_61, or a virtual architecture. e.g., compute_61.
The CUDA driver and runtime also have a software version, defining things like available C++ functions. The latest is 10.1. This is unrelated to the capability version.

8.2 Misc

With CUDA, the dominant problem in program optimization is optimizing the data flow. Getting the data quickly to the cores is harder than processing it. It helps big to have regular arrays, where each core reads or writes a successive entry.

This is analogous to the hardware fact that wires are bigger (hence, more expensive) than gates.
That is the opposite optimization to OpenMP, where having different threads writing to adjacent addresses will cause the false sharing problem.
Nvidia CUDA FAQ
1. has links to other Nvidia docs.
2. can be a little old.

9 Types of memory allocation

Here's a brief overview of my understanding of the various places that you can assign memory in a program.

Static. Define a fixed-size array global array. The variable is constructed at compile time, so accesses might perhaps be faster. Global vars with non default initial values increase the executable file size. If they're large enough, you need to use the compiler option -mcmodel=medium or -mcmodel=large. They cause the compiler to generate wider addresses. I don't know the effect on the program's size or speed, but suspect that it's small.
Stack. Define local arrays, that are created and freed as the routine is entered and exited. Their addresses relative to the base of this call frame may be constant. The default stack size is 8MB. You can increase this with the command ulimit or in the program as shown in stacksize.cc. I believe that in OpenMP, the max stacksize may be allocated when each thread is created. Then, a really big stackssize might have a penalty.
Heap. You use new and destroy. Variables are constructed whenever you want. The more objects on the heap, the more time that each new or destroy takes. If you have lots of objects consider using placement new or creating an array of them.

For CUDA, some variables must be on the heap.

I like to use static, then stack, and heap only when necessary. However, allocating few but large, blocks on the heap is also fast.

Google's allocator is noticeably better than the default one. To use it, link your programs with -ltcmalloc. You can often use it on an existing executable foo thus:

LD_PRELOAD="/usr/lib/libtcmalloc.so" foo

I found it to save 15% to 30% in time.

Another memory concern is speed. Parallel has a NUMA (Non Uniform Memory Architecture). It has two 14-core Xeons. Each core has 128GB of main memory. Although all 256GB are in a common address space, accessing memory on same core as the thread is running on is faster.

The following is what I think based on some research, but may be wrong: A 4KB page of memory is assigned to a specific core when it is first written (not when it is reserved). So, each page of a large array may be on a different core. This can be used to optimize things. This gets more fun with 8-processor systems.

All that is separate from cache issues.

You can also assign your OpenMP threads to specific cores. This affects speed in ways I don't understand. The issues are resource sharing vs conflicts.

10 Nvidia GPU summary

Here's a summary of the Nvidia Pascal GP104 GPU architecture as I understand it. It's more compact than I've found elsewhere. I'll add to it from time to time. Some numbers are probably wrong.

The host is the CPU.
The device is the GPU.
The device contains 20 streaming multiprocessors (SMs).

Different GPU generations have used the terms SMX or SMM.
A thread is a sequential program with private and shared memory, program counter, etc.
Threads are grouped, 32 at a time, into warps.
Warps of threads are grouped into blocks.

Often the warps are only implicit, and we consider that the threads are grouped directly into blocks.

That abstract hides details that may be important; see below.
Blocks of threads are grouped into a grid, which is all the threads in the kernel.
A kernel is a parallel program executing on the device.
1. The kernel runs potentially thousands of threads.
2. A kernel can create other kernels and wait for their completion.
3. There may be a limit, e.g., 5 seconds, on a kernel's run time.
Thread-level resources:
1. Each thread can use up to 255 fast registers. Registers are private to the thread.
  
  All the threads in one block have their registers allocated from a fixed pool of 65536 registers. The more registers that each thread uses, the fewer warps in the block can run simultaneously.
2. Each thread has 512KB slow local memory, allocated from the global memory.
3. Local memory is used when not enough registers are available, and to store thread-local arrays.
Warp-level resources:
1. Threads are grouped, 32 at a time, into warps.
2. Each warp executes as a SIMD, with one instruction register. At each cycle, every thread in a warp is either executing the same instruction, or is disabled. If the 32 threads want to execute 32 different instructions, then they will execute one after the other, sequentially.
  
  If you read in some NVidia doc that threads in a warp run independently, then continue reading the next page to get the info mentioned in the previous paragraph.
3. If successive instructions in a warp do not depend on each other, then, if there are enough warp schedulers available, they may be executed in parallel. This is called Instruction Level Parallelism (ILP).
4. For an array in local memory, which means that each thread will have its private copy, the elements for all the threads in a warp are interleaved to potentially increase the I/O rate.
  
  Therefore your program should try to have successive threads read successive words of arrays.
5. A thread can read variables from other threads in the same warp, with the shuffle instruction. Typical operation are to read from the K-th next thread, to do a butterfly permutation, or to do an indexed read. This happens in parallel for the whole warp, and does not use shared memory.
6. A warp vote combines a bit computed by each thread to report results like all or any.
Block-level resources:
1. A block may contain up to 1024 threads.
2. Each block has access to 65536 fast 32-bit registers, for the use of its threads.
3. Each block can use up to 49152 bytes of the SM's fast shared memory. The block's shared memory is shared by all the threads in the block, but is hidden from other blocks.
  
  Shared memory is basically a user-controllable cache of some global data. The saving comes from reusing that shared data several times after you loaded it from global memory once.
  
  Shared memory is interleaved in banks so that some access patterns are faster than others.
4. Warps in a block run asynchronously and run different instructions. They are scheduled and executed as resources are available.
5. However they are all running the same instruction sequence, perhaps at different points in it.
6. That is call SPMD, single program multiple data.
7. The threads in a block can be synchonized with __syncthreads().
  
  Because of how warps are scheduled, that can be slow.
8. The threads in a block can be arranged into a 3D array, up to 1024x1024x64.
  
  That is for convenience, and does not increase performance (I think).
9. I'll talk about textures later.
Streaming Multiprocessor (SM) - level resources:
1. Each SM has 128 single-precision CUDA cores, 64 double-precision units, 32 special function units, and 32 load/store units.
2. In total, the GPU has 2560 CUDA cores.
3. A CUDA core is akin to an ALU. The cores, and all the units, are pipelined.
4. A CUDA core is much less powerful than one core of an Intel Xeon. My guess is 1/20th.
5. Beware that, in the CUDA C Programming Guide, NVidia sometimes calls an SM a core.
6. The limited number of, e.g., double precision units means that an DP instruction will need to be scheduled several times for all the threads to execute it. That's why DP is slower.
7. Each SM has 4 warp schedulers and 8 instruction dispatch units.
8. 64 warps can simultaneously reside in an SM.
9. Therefore up to 32x64=2048 threads can be executed in parallel by an SM.
10. Up to 16 blocks that can simultaneously be resident in an SM.
  
  However, if each block uses too many resources, like shared memory, then this number is reduced.
  
  Each block sits on only one SM; no block is split. However a block's warps are executed asynchronously (until synced).
11. Each SM has 64KiB (?) fast memory to be divided between shared memory and an L1 cache. Typically, 48KiB (96?) is used for the shared memory, to be divided among its resident blocks, but that can be changed.
12. The 48KB L1 cache can cache local or global memory.
13. Each SM has a read-only data cache of 48KB to cache the global constant memory.
14. Each SM has 8 texture units, and many other graphics capabilities.
15. Each SM has 256KB of L2 cache.
Grid-level resources:
1. The blocks in a grid can be arranged into a 3D array. up to \((2^{31}-1, 2^{16}-1, 2^{16}-1)\).
2. Blocks in a grid might run on different SMs.
3. Blocks in a grid are queued and executed as resources are available, in an unpredictable parallel or serial order. Therefore they should be independent of each other.
4. The number of instructions in a kernel is limited.
5. Any thread can stop the kernel by calling assert.
Device-level resources:
1. There is a large and slow 48GB global memory, which persists from kernel to kernel.
  
  Transactions to global memory are 128 bytes.
  
  Host memory can also be memory-mapped into global memory, although the I/O rate will be lower.
  
  Reading from global memory can take hundreds of cycles. A warp that does this will be paused and another warp started. Such context switching is very efficient. Therefore device throughput stays high, although there is a latency. This is called Thread Level Parallelism (TLP) and is a major reason for GPU performance.
  
  That assumes that an SM has enough active warps that there is always another warp available for execution. That is a reason for having warps that do not use all the resources (registers etc) that they're allowed to.
2. There is a 2MB L2 cache, for sharing data between SMs.
3. There is a 64KiB Small and fast global constant memory, , which also persists from kernel to kernel. It is implemented as a piece of the global memory, made fast with caches.
  
  (Again, I'm still resolving this apparent contradiction).
4. Grid Management Unit (GMU) schedules (pauses, executes, etc) grids on the device. This is more important because grids can start other grids (Dynamic Parallelism).
5. Hyper-Q: 32 simultaneous CPU tasks can launch kernels into the queue; they don't block each other. If one kernel is waiting, another runs.
6. CUDA Work Distributor (CWD) dispatches 32 active grids at a time to the SMs. There may be 1000s of grids queued and waiting.
7. GPU Direct: Other devices can DMA the GPU memory.
8. The base clock is 1607MHz.
9. GFLOPS: 8873.
10. Memory bandwidth: 320GB/s
GPU-level resources:
1. Being a Geforce product, there are many graphics facilities that we're not using.
2. There are 4 Graphics processing clusters (GPCs) to do graphics stuff.
3. Several perspective projections can be computed in parallel, for systems with several displays.
4. There's HW for texture processing.
Generational changes:
1. With each new version, Nvidia tweaks the numbers. Some get higher, others get lower.
  1. E.g., Maxwell had little HW for double precision, and so that was slow.
  2. Pascal's clock speed is much higher.
Refs:
1. The CUDA program deviceDrv.
2. http://developer.download.nvidia.com/compute/cuda/compute-docs/cuda-performance-report.pdf
3. http://international.download.nvidia.com/geforce-com/international/pdfs/GeForce_GTX_1080_Whitepaper_FINAL.pdf
4. Better Performance at Lower Occupancy, Vasily Volkov, UC Berkeley, 2010.
5. https://www.pgroup.com/lit/articles/insider/v2n1a5.htm - well written but old.
(I'll keep adding to this. Suggestions are welcome.)

11 More CUDA

CUDA function qualifiers:
1. __global__ device function called from host, starting a kernel.
2. __device__ device function called from device function.
3. __host__ (default) host function called from host function.
CUDA variable qualifiers:
1. __shared__
2. __device__ global
3. __device__ __managed__ automatically paged between host and device.
4. __constant__
5. (nothing) register if scalar, or local if array or if no more registers available.
If installing CUDA on your machine, this repository seems best:

http://developer.download.nvidia.com/compute/cuda/repos/ubuntu1604/x86_64

That includes the Thrust headers but not example programs.

12 Unionfs: Linux trick of the day

aka overlay FS, translucent FS.
If a, b are directories, and m is an empty directory, then

unionfs -o cow a=RW:b m

makes m to be a combo of a and b, with a being higher priority
Writing a file into m writes it in a.
Changing a file in b writes the new version into a
Deleting a file in b causes a white-out note to be stored in a.
Unmount it thus:

fusermount -u m
None of this requires superuser.
Application: making a read-only directory into a read-write directory.
Note: IBM had a commercial version of this idea in its CP/CMS OS in the 1960s.

13 Thrust

Stanford's parallel course notes. We did up thru the lecture 6 and parts of 7 and 8.

Par ECSE-4740-01 Applied Parallel Computing for Engineers, Spring 2021, Rensselaer Polytechnic Institute