Quantum Class 19, Thu 2022-11-17

W Randolph Franklin (WRF), Rensselaer Polytechnic Institute (RPI)

It is possible to succeed after RPI.

1.1 Tony Tether

Tony Tether IIRC, the longest serving DARPA Director. He initiated the 2004 DARPA_Grand_Challenge that spurred the development of autonomous vehicles.

1.3 Ed Zander

COO Sun, CEO Motorola.

https://en.wikipedia.org/wiki/Edward_Zander

2 General programming tips

2.1 Unionfs

aka overlay FS, translucent FS.
If a, b are directories, and m is an empty directory, then

unionfs -o cow a=RW:b m

makes m to be a combo of a and b, with a being higher priority
Writing a file into m writes it in a.
Changing a file in b writes the new version into a
Deleting a file in b causes a white-out note to be stored in a.
Unmount it thus:

fusermount -u m
None of this requires superuser.
Application: making a read-only directory into a read-write directory.
Note: IBM had a commercial version of this idea in its CP/CMS OS in the 1960s.

2.2 Types of virtualization

There are many possible levels of virtualization.
At a low level, one might emulate the HW. This is quite flexible but too slow.
At a higher level, a basic OS runs separate virtual machines, each with its own file system.
1. Harmless machine instructions execute normally.
2. Powerful ones are trapped and emulated.
3. This requires a properly designed instruction set.
4. IBM has been doing this commercially for 40 years, with something originally called CP/CMS.
  
  I think that IBM lucked out with their instruction set design, and didn't plan it.
5. Well-behaved clients might have problematic code edited before running, to speed the execution.
  
  I think that Vmware does that.
6. It seems that compute-intensive clients might have almost no overhead.
7. However, the emulated file system can be pretty slow.
8. With Vmware, several clients can all be different OSs, and the host can be any compatible OS.
9. E.g., I've had a linux vmware host simultaneously running both linux and windows clients.
In linux, root no longer has infinite power.
The next level of virtualization has an nontrivial host OS, but separates the clients from each other.
1. They see a private view of the process space, file system, and other resources.
2. This is lighter weight, e.g., quicker to start a VM and less overhead.
3. The host and client must be the same OS.
4. This might be called paravirtualization.
5. Linux supports this with things like namespace isolation and control groups (cgroups). Wikipedia et al describe this.
6. Ubuntu snaps do something like this.
  
  E.g., firefox is distributed as a snap to increase isolation and security.
  
  However starting firefox now takes 15 sec.
The next level up is the normal linux security.
1. You can see all the processes and similar resources.
2. The file system has the usual protections.
3. This is hard to make secure when doing something complicated.
4. How do I protect myself from firefox going bad?
5. It's easy to describe what it should be allowed to do, but almost impossible to implement.
6. That includes using apparmor etc.
Who guards the guards? I get spammed at a unique address that I used only to register with apparmor.
In theory, packaging an app in a virtual machine has fewer dependencies and is more secure.
You can run the vm w/o changes on different hosts.
A Vmware client can run w/o change on both linux and windows hosts.
You can run a client on your own hardware, then spill over to a commercial cloudy platform when necessary.

2.3 Docker

Docker is a popular lightweight virtualization system, which Nvidia uses to distribute SW.
Docker runs images that define virtual machines.
Docker images share resources with the host, in a controlled manner.
For simple images, which is not nvidia/cuda, starting the image is so cheap that you can do it to run one command, and encapsulate the whole process in a shell function.
Docker is worth learning, apart from its use by Nvidia for parallel computing. You might also look up Kubernetes.
More info:

2.4 Several forms of C++ functions

Traditional top level function

auto add(int a, int b) { return a+b;}

You can pass this to a function. This really passes a pointer to the function. It doesn't optimize across the call.

These have global scope.
Note auto. It's underused.
Overload operator() in a new class

Each different variable of the class is a different function. The function can use the variable's value. This is a closure.

This is local to the containing block.

This form optimizes well.
Lambda, or anon function.

auto add = [](int a, int b) { return a+b;};

This is local to the containing block.

This form optimizes well.
Placeholder notation.

As an argument in, e.g., thrust transform, you can do this:

transform(..., _1+_2);

This is nice and short.

As this is implemented by overloading the operators, the syntax of the expression is limited to what was overloaded.

3 Parallel computing summary

Read https://computing.llnl.gov/tutorials/parallel_comp/ for an intro to parallel computing.
Some points:
1. Parallel computing is decades old; there were commercial machines in the 1980s. I directed two PhD theses in parallel geometry then. However, then, clock speeds were increasing so serial machines were more interesting.
2. Now: physics limits to processor speed.
3. History of Nvidia.
  1. Curtis R. Priem ’82, Secretary of the RPI Board had designed graphics HW for both IBM and Sun Microsystems.
  2. Priem was an undergrad in ECSE.
  3. For awhile Sun was THE Unix workstation company. They used open standards and had the best price / performance.
  4. Nvidia designed gaming graphics accelerators...
  5. that just happened to be parallel coprocessors...
  6. that started to be used for nongraphics parallel processing because of their value.
  7. Nvidia noticed that and added more capability, e.g., double precision IEEE floats, to serve that market.
  8. Currently, some of the highest performance Nvidia boards cannot even do graphics because they don't have video out ports.
1. Intel CPUs vs Nvidia CUDA cores.
2. Advantages and disadvantages of shared memory.
3. OpenMP vs CUDA.
4. Rate-limiting cost usually I/O not computation.
NVIDIA's teaching material.

Note: Rather than explicitly extracting large zip archives or tarballs in order to read files in them, I use archivemount to create a virtual file system. This saves disk space. It doesn't stress git as much (fewer files). When reading, the I/O time is insignificantly increased. For some formats, you can even write. You can have more confidence that the zip file wasn't changed, than in a directory with perhaps hundreds of files.

4 NVIDIA

4.2 Top500

Many of the top 500 supercomputers contain NVIDIA modules, often together with IBM Power Systems.

4.3 Nvidia primary documentation

Generally more up-to-date and accurate, but drier than the secondary docs. A little disorganized because it keeps growing. The root is here: https://docs.nvidia.com/

Two major relevant sets are:

4.4 Nvidia conceptual hierarchy

As always, this is as I understand it, and could be wrong. Nvidia uses their own terminology inconsistently. They may use one name for two things (E.g., Tesla and GPU), and may use two names for one thing (e.g., module and accelerator). As time progresses, they change their terminology.

At the bottom is the hardware micro-architecture. This is an API that defines things like the available operations. The last several Nvidia micro-architecture generations are, in order, Tesla (which introduced unified shaders), Fermi, Kepler, Maxwell (introduced in 2014), Pascal (2016), and Volta (2018).
Each micro-architecture is implemented in several different microprocessors. E.g., the Kepler micro-architecture is embodied in the GK107, GK110, etc. Pascal is GP104 etc. The second letter describes the micro-architecture. Different microprocessors with the same micro-architecture may have different amounts of various resources, like the number of processors and clock rate.
To be used, microprocessors are embedded in graphics cards, aka modules or accelerators, which are grouped into series such as GeForce, Quadro, etc. Confusingly, there is a Tesla computing module that may use any of the Tesla, Fermi, or Kepler micro-architectures. Two different modules using the same microprocessor may have different amounts of memory and other resources. These are the components that you buy and insert into a computer. A typical name is GeForce GTX1080.
There are many slightly different accelerators with the same architecture, but different clock speeds and memory, e.g. 1080, 1070, 1060, ...
The same accelerator may be manufactured by different vendors, as well as by Nvidia. These different versions may have slightly different parameters. Nvidia's reference version may be relatively low performance.
The term GPU sometimes refers to the microprocessor and sometimes to the module.
There are at least four families of modules: GeForce for gamers, Quadro for professionals, Tesla for computation, and Tegra for mobility.
Nvidia uses the term Tesla in two unrelated ways. It is an obsolete architecture generation and a module family.
Geoxeon has a (Maxwell) GeForce GTX Titan and a (Kepler) Tesla K20xm. Parallel has a (Volta) RTX 8000 and (Pascal) GeForce GTX 1080. We also have an unused (Kepler) Quadro K5000.
Since the highest-end (Tesla) modules don't have video out, they are also called something like compute modules.

4.5 GPU range of speeds

Here is an example of the wide range of Nvidia GPU speeds; all times are +-20%.

The Quadro RTX 8000 has 4608 CUDA cores @ 1.77GHz and 48GB of memory. matrixMulCUBLAS runs at 5310 GFlops. The specs claim 16 TFlops. However those numbers understate its capabilities because it also has 576 Tensor cores and 72 ray tracing cores to cast 11G rays/sec.

The GeForce GTX 1080 has 2560 CUDA cores @ 1.73GHz and 8GB of memory. matrixMulCUBLAS runs at 3136 GFlops. However the reported time (0.063 msec) is so small that it may be inaccurate. The quoted speed of the 1080 is about triple that. I'm impressed that the measured performance is so close.

The Quadro K2100M in my Lenovo W540 laptop has 576 CUDA cores @ 0.67 GHz and 2GB of memory. matrixMulCUBLAS runs at 320 GFlops. The time on the GPU was about .7 msec, and on the CPU 600 msec.

It's nice that the performance almost scaled with the number of cores and clock speed.

4.6 CUDA

4.6.1 Versions

CUDA has a capability version, whose major number corresponds to the micro-architecture generation. Kepler is 3.x. The K20xm is 3.5. The GTX 1080 is 6.1. The RTX 8000 is 7.5. Here is a table of the properties of different compute capabilities. However, that table is not completely consistent with what deviceQuery shows, e.g., the shared memory size.
nvcc, the CUDA compiler, can be told which capabilities (aka architectures) to compile for. They can be given as a real architecture, e.g., sm_61, or a virtual architecture. e.g., compute_61.
The CUDA driver and runtime also have a software version, defining things like available C++ functions. The latest is 10.1. This is unrelated to the capability version.

4.6.2 Misc

With CUDA, the dominant problem in program optimization is optimizing the data flow. Getting the data quickly to the cores is harder than processing it. It helps big to have regular arrays, where each core reads or writes a successive entry.

This is analogous to the hardware fact that wires are bigger (hence, more expensive) than gates.
That is the opposite optimization to OpenMP, where having different threads writing to adjacent addresses will cause the false sharing problem.
Nvidia CUDA FAQ
1. has links to other Nvidia docs.
2. can be a little old.

4.7 Nvidia GPU summary

Here's a summary of the Nvidia Pascal GP104 GPU architecture as I understand it. It's more compact than I've found elsewhere. I'll add to it from time to time. Some numbers are probably wrong.

The host is the CPU.
The device is the GPU.
The device contains 20 streaming multiprocessors (SMs).

Different GPU generations have used the terms SMX or SMM.
A thread is a sequential program with private and shared memory, program counter, etc.
Threads are grouped, 32 at a time, into warps.
Warps of threads are grouped into blocks.

Often the warps are only implicit, and we consider that the threads are grouped directly into blocks.

That abstract hides details that may be important; see below.
Blocks of threads are grouped into a grid, which is all the threads in the kernel.
A kernel is a parallel program executing on the device.
1. The kernel runs potentially thousands of threads.
2. A kernel can create other kernels and wait for their completion.
3. There may be a limit, e.g., 5 seconds, on a kernel's run time.
Thread-level resources:
1. Each thread can use up to 255 fast registers. Registers are private to the thread.
  
  All the threads in one block have their registers allocated from a fixed pool of 65536 registers. The more registers that each thread uses, the fewer warps in the block can run simultaneously.
2. Each thread has 512KB slow local memory, allocated from the global memory.
3. Local memory is used when not enough registers are available, and to store thread-local arrays.
Warp-level resources:
1. Threads are grouped, 32 at a time, into warps.
2. Each warp executes as a SIMD, with one instruction register. At each cycle, every thread in a warp is either executing the same instruction, or is disabled. If the 32 threads want to execute 32 different instructions, then they will execute one after the other, sequentially.
  
  If you read in some NVidia doc that threads in a warp run independently, then continue reading the next page to get the info mentioned in the previous paragraph.
3. If successive instructions in a warp do not depend on each other, then, if there are enough warp schedulers available, they may be executed in parallel. This is called Instruction Level Parallelism (ILP).
4. For an array in local memory, which means that each thread will have its private copy, the elements for all the threads in a warp are interleaved to potentially increase the I/O rate.
  
  Therefore your program should try to have successive threads read successive words of arrays.
5. A thread can read variables from other threads in the same warp, with the shuffle instruction. Typical operation are to read from the K-th next thread, to do a butterfly permutation, or to do an indexed read. This happens in parallel for the whole warp, and does not use shared memory.
6. A warp vote combines a bit computed by each thread to report results like all or any.
Block-level resources:
1. A block may contain up to 1024 threads.
2. Each block has access to 65536 fast 32-bit registers, for the use of its threads.
3. Each block can use up to 49152 bytes of the SM's fast shared memory. The block's shared memory is shared by all the threads in the block, but is hidden from other blocks.
  
  Shared memory is basically a user-controllable cache of some global data. The saving comes from reusing that shared data several times after you loaded it from global memory once.
  
  Shared memory is interleaved in banks so that some access patterns are faster than others.
4. Warps in a block run asynchronously and run different instructions. They are scheduled and executed as resources are available.
5. However they are all running the same instruction sequence, perhaps at different points in it.
6. That is call SPMD, single program multiple data.
7. The threads in a block can be synchonized with __syncthreads().
  
  Because of how warps are scheduled, that can be slow.
8. The threads in a block can be arranged into a 3D array, up to 1024x1024x64.
  
  That is for convenience, and does not increase performance (I think).
9. I'll talk about textures later.
Streaming Multiprocessor (SM) - level resources:
1. Each SM has 128 single-precision CUDA cores, 64 double-precision units, 32 special function units, and 32 load/store units.
2. In total, the GPU has 2560 CUDA cores.
3. A CUDA core is akin to an ALU. The cores, and all the units, are pipelined.
4. A CUDA core is much less powerful than one core of an Intel Xeon. My guess is 1/20th.
5. Beware that, in the CUDA C Programming Guide, NVidia sometimes calls an SM a core.
6. The limited number of, e.g., double precision units means that an DP instruction will need to be scheduled several times for all the threads to execute it. That's why DP is slower.
7. Each SM has 4 warp schedulers and 8 instruction dispatch units.
8. 64 warps can simultaneously reside in an SM.
9. Therefore up to 32x64=2048 threads can be executed in parallel by an SM.
10. Up to 16 blocks that can simultaneously be resident in an SM.
  
  However, if each block uses too many resources, like shared memory, then this number is reduced.
  
  Each block sits on only one SM; no block is split. However a block's warps are executed asynchronously (until synced).
11. Each SM has 64KiB (?) fast memory to be divided between shared memory and an L1 cache. Typically, 48KiB (96?) is used for the shared memory, to be divided among its resident blocks, but that can be changed.
12. The 48KB L1 cache can cache local or global memory.
13. Each SM has a read-only data cache of 48KB to cache the global constant memory.
14. Each SM has 8 texture units, and many other graphics capabilities.
15. Each SM has 256KB of L2 cache.
Grid-level resources:
1. The blocks in a grid can be arranged into a 3D array. up to \((2^{31}-1, 2^{16}-1, 2^{16}-1)\).
2. Blocks in a grid might run on different SMs.
3. Blocks in a grid are queued and executed as resources are available, in an unpredictable parallel or serial order. Therefore they should be independent of each other.
4. The number of instructions in a kernel is limited.
5. Any thread can stop the kernel by calling assert.
Device-level resources:
1. There is a large and slow 48GB global memory, which persists from kernel to kernel.
  
  Transactions to global memory are 128 bytes.
  
  Host memory can also be memory-mapped into global memory, although the I/O rate will be lower.
  
  Reading from global memory can take hundreds of cycles. A warp that does this will be paused and another warp started. Such context switching is very efficient. Therefore device throughput stays high, although there is a latency. This is called Thread Level Parallelism (TLP) and is a major reason for GPU performance.
  
  That assumes that an SM has enough active warps that there is always another warp available for execution. That is a reason for having warps that do not use all the resources (registers etc) that they're allowed to.
2. There is a 2MB L2 cache, for sharing data between SMs.
3. There is a 64KiB Small and fast global constant memory, , which also persists from kernel to kernel. It is implemented as a piece of the global memory, made fast with caches.
  
  (Again, I'm still resolving this apparent contradiction).
4. Grid Management Unit (GMU) schedules (pauses, executes, etc) grids on the device. This is more important because grids can start other grids (Dynamic Parallelism).
5. Hyper-Q: 32 simultaneous CPU tasks can launch kernels into the queue; they don't block each other. If one kernel is waiting, another runs.
6. CUDA Work Distributor (CWD) dispatches 32 active grids at a time to the SMs. There may be 1000s of grids queued and waiting.
7. GPU Direct: Other devices can DMA the GPU memory.
8. The base clock is 1607MHz.
9. GFLOPS: 8873.
10. Memory bandwidth: 320GB/s
GPU-level resources:
1. Being a Geforce product, there are many graphics facilities that we're not using.
2. There are 4 Graphics processing clusters (GPCs) to do graphics stuff.
3. Several perspective projections can be computed in parallel, for systems with several displays.
4. There's HW for texture processing.
Generational changes:
1. With each new version, Nvidia tweaks the numbers. Some get higher, others get lower.
  1. E.g., Maxwell had little HW for double precision, and so that was slow.
  2. Pascal's clock speed is much higher.
Refs:
1. The CUDA program deviceDrv.
2. http://developer.download.nvidia.com/compute/cuda/compute-docs/cuda-performance-report.pdf
3. http://international.download.nvidia.com/geforce-com/international/pdfs/GeForce_GTX_1080_Whitepaper_FINAL.pdf
4. Better Performance at Lower Occupancy, Vasily Volkov, UC Berkeley, 2010.
5. https://www.pgroup.com/lit/articles/insider/v2n1a5.htm - well written but old.
(I'll keep adding to this. Suggestions are welcome.)

4.8 More CUDA

CUDA function qualifiers:
1. __global__ device function called from host, starting a kernel.
2. __device__ device function called from device function.
3. __host__ (default) host function called from host function.
CUDA variable qualifiers:
1. __shared__
2. __device__ global
3. __device__ __managed__ automatically paged between host and device.
4. __constant__
5. (nothing) register if scalar, or local if array or if no more registers available.
If installing CUDA on your machine, this repository seems best:

http://developer.download.nvidia.com/compute/cuda/repos/ubuntu1604/x86_64

That includes the Thrust headers but not example programs.

4.9 Other Nvidia features

We've seen almost everything, except:

Texture and surface maps.
ML HW like A=BC+D for 4x4 matrices.
Ray tracing HW, to compute a ray's intersections with boxes.
Cooperative groups: with Ampere, subsets of a warp can synchronize.
Subsets of a GPU can be defined as virtual GPUS, which are walled off from each other.
Memory can be compressed when stored, making a space-time tradeff.
The terminology CUDA core is obsolete. Now, they say that an SM has, perhaps 32 single float units, 32 integer units, 32 CUDA instruction dispatchers, and 16 double float units, etc. Each unit operates independently.

5 NVIDIA Quantum Computing

Nice intro: https://docs.nvidia.com/cuda/cuquantum/overview.html
https://quantumzeitgeist.com/artificial-intelligence-giant-nvidia-creates-a-new-platform-for-hybrid-quantum-classical-computing-qoda/
https://developer.nvidia.com/qoda
https://nvidianews.nvidia.com/news/nvidia-announces-hybrid-quantum-classical-computing-platform
NVIDIA Special Address at Q2B: Defining the Quantum Accelerated Supercomputing Platform 18:39. Jul 12, 2022

Quantum computing has the potential to offer giant leaps in computational capabilities, impacting a range of industries from drug discovery to portfolio optimization. Realizing these benefits requires pushing the boundaries of quantum information science in the development of algorithms, research into more capable quantum processors, and the creation of tightly integrated quantum-classical systems and tools. We'll review these challenges facing #quantumcomputing, showcase how #GPUcomputing can help, and reveal exciting developments in tightly integrated quantum-classical computing. https://developer.nvidia.com/qoda
Watch Nvidia Reveal Quantum Computing Platform, QODA 6:53 Jul 12, 2022

At Q2B, Nvidia announces QODA, a new hybrid quantum-classical computing platform. See it explained here.
https://developer.nvidia.com/blog/introducing-qoda-the-platform-for-hybrid-quantum-classical-computing/
https://blogs.nvidia.com/blog/2022/07/29/what-is-a-qpu/
Q2B 2021 | Accelerating Quantum Algorithm Research with cuQuantum | Harun Bayraktar 29:52. December 9, 2021. Excellent solid talk.

Quantum ECSE-4964/6964 Quantum Computer Programming, Fall 2022, Rensselaer Polytechnic Institute