PAR Lecture 11, Thurs Feb 23
Table of contents
1 Parallel.ecse programs
/parallel-class/cuda/matmul2.cu plays with matrix multiplication of two random 1000x1000 matrices, and with managed memory.
- It shows a really quick way to use OpenMP, which has a 10x speedup.
- It shows a really quick way to use CUDA, which has a 15x speedup. This just uses one thread block per input matrix A row, and one thread per output element. That's 1M threads. The data is read from managed memory. Note how easy it is. There is no storing tiles into fast shared memory.
- It multiplies the matrices on the host, and compares reading them from normal memory and from managed memory. The latter is 2.5x slower. Dunno why.
- matmul2 also shows some of my utility functions and macros.
- InitCUDA prints a description of the GPU etc.
- cout << PRINTC(expr) prints an expression's name and then its value and a comma.
- PRINTN is like PRINTC but ends with endl.
- TIME(expr) evals an expression then prints its name and total and delta
- CPU time,
- elapsed time,
- their ratio.
- CT evals and prints the elapsed time of a CUDA kernel.
- Later I may add new tests to this.
2 Computer factoid
Unrelated to this course, but perhaps interesting:
All compute servers have for decades had a microprocessor frontend that controls the boot process. The current iteration is called an IPMI. It has a separate ethernet port, and would allow remote bios configs and booting. On parallel, that port is not connected (I don't trust the security).
3 CUDA Doc
The start of Nvidia's Parallel thread execution has useful info.
This is one of a batch of CUDA docs. Browse as you wish.
4 Stanford lectures
All the lectures and associated files are on parallel.ecse at /parallel-class/stanford/ .
They are also online here.
Lecture 6 parallel patterns 1 presents some paradigms of parallel programming. These are generally useful building blocks for parallel algorithms.
Comments
Comments powered by Disqus