/parallel-class/cuda/matmul2.cu plays with matrix multiplication of two random 1000x1000 matrices, and with managed memory.
- It shows a really quick way to use OpenMP, which has a 10x speedup.
- It shows a really quick way to use CUDA, which has a 15x speedup. This just uses one thread block per input matrix A row, and one thread per output element. That's 1M threads. The data is read from managed memory. Note how easy it is. There is no storing tiles into fast shared memory.
- It multiplies the matrices on the host, and compares reading them from normal memory and from managed memory. The latter is 2.5x slower. Dunno why.
- matmul2 also shows some of my utility functions and macros.
- InitCUDA prints a description of the GPU etc.
- cout << PRINTC(expr) prints an expression's name and then its value and a comma.
- PRINTN is like PRINTC but ends with endl.
- TIME(expr) evals an expression then prints its name and total and delta
- CPU time,
- elapsed time,
- their ratio.
- CT evals and prints the elapsed time of a CUDA kernel.
- Later I may add new tests to this.
Unrelated to this course, but perhaps interesting:
All compute servers have for decades had a microprocessor frontend that controls the boot process. The current iteration is called an IPMI. It has a separate ethernet port, and would allow remote bios configs and booting. On parallel, that port is not connected (I don't trust the security).
The start of Nvidia's Parallel thread execution has useful info.
This is one of a batch of CUDA docs. Browse as you wish.