Homework 4, due Thurs 3-20-14
Implement matrix multiplication, to multiply a 16384x16384 single-precision float matrix by itself. Implement the several ways listed below and compare times. Do not count the time to create and fill the matrices. Measure the total time to move the data to/from the device (if necessary) and multiply the matrix.
The point of this homework is to practice CUDA programming, not to produce the best possible matrix multiplication program. Therefore, write the code yourself in C++. Do not call libraries such as cuBLAS. However, if you wish, after writing your own code, you may also write a version calling libraries and report those results.
- On the host.
- On the device, accessing the host data with unified memory.
- On the device, copying the data back and forth.
- On the device, using unified virtual addressing. In the above 4 methods, most of the code is the same.
- On the device, using texture memory.
- If you are really ambitious, try a recursive asymptotically faster algorithm, such as Strassen, whose time is {$T(N) = \Theta\left(N^{\lg7}\right) $}, on the fastest memory method above. The question here is whether the asymptotic time improvement will be cancelled by the more complex algorithm. Implement the recursion by having kernels start other kernels.
Feel free to start with the code in the lecture slides and programs in the CUDA SDK.
This exercise requires device compute capability 3.5 and CUDA 6.0. (It's beneficial to use the latest versions since they are more capable.) If you haven't gotten an account on geoxeon and want one, send me yours RCS user name and a public SSH key.