PAR Class 7, Wed 2018-02-28
Table of contents
1 Term project progress
How's this going? Send me a progress report.
2 Parallel programs
/parallel-class/cuda/matmul2.cu plays with matrix multiplication of two random 1000x1000 matrices, and with managed memory.
-
It shows a really quick way to use OpenMP, which has a 10x speedup.
-
It shows a really quick way to use CUDA, which has a 15x speedup. This just uses one thread block per input matrix A row, and one thread per output element. That's 1M threads. The data is read from managed memory. Note how easy it is. There is no storing tiles into fast shared memory.
-
It multiplies the matrices on the host, and compares reading them from normal memory and from managed memory. The latter is 2.5x slower. Dunno why.
-
matmul2 also shows some of my utility functions and macros.
- InitCUDA prints a description of the GPU etc.
- cout << PRINTC(expr) prints an expression's name and then its value and a comma.
- PRINTN is like PRINTC but ends with endl.
- TIME(expr) evals an expression then prints its name and total and delta
- CPU time,
- elapsed time,
- their ratio.
- CT evals and prints the elapsed time of a CUDA kernel.
- Later I may add new tests to this.
-
/parallel-class/cuda/checksum.cc shows a significant digits problem when you add many small numbers.
-
sum_reduction.cu is Stanford's program.
-
sum_reduction2.cu is my modification to use managed memory.
Note how both sum_reduction and sum_reduction2 give different answers for the serial and the parallel computation. That is bad.
-
sum_reduction3.cu is a mod to try to find the problem. One problem is insufficient precision in the sum. Using double works. However there might be other problems.
3 Computer factoid
Unrelated to this course, but perhaps interesting:
All compute servers have for decades had a microprocessor frontend that controls the boot process. The current iteration is called an IPMI. It has a separate ethernet port, and would allow remote bios configs and booting. On parallel, that port is not connected (I don't trust the security).
4 CUDA Doc
The start of Nvidia's Parallel thread execution has useful info.
This is one of a batch of CUDA docs. Browse as you wish.
5 Stanford lectures
All the lectures and associated files are on geoxeon and parallel.ecse at /parallel-class/stanford/ .
They are also online here.
Lecture 6 parallel patterns 1 presents some paradigms of parallel programming. These are generally useful building blocks for parallel algorithms.
6 Thrust
Stanford lecture 8: Thrust. This is a functional frontend to various backends, such as CUDA.
Programs are in /parallel-class/thrust/ .
7 IBM's quantum computer
As mentioned by Narayanaswami Chandrasekhar last week.
- https://en.wikipedia.org/wiki/IBM_Quantum_Experience
- https://techcrunch.com/2017/11/10/ibm-passes-major-milestone-with-20-and-50-qubit-quantum-computers-as-a-service/
- https://www.technologyreview.com/s/609451/ibm-raises-the-bar-with-a-50-qubit-quantum-computer/
- https://www.research.ibm.com/ibm-q/ - points to lots of info, e.g., QISKit on github, beginners guide, etc.
You to do: learn this and present it to me in class.