PAR Class 7, Wed 2018-02-28

1   Term project progress

How's this going? Send me a progress report.

2   Parallel programs

/parallel-class/cuda/matmul2.cu plays with matrix multiplication of two random 1000x1000 matrices, and with managed memory.

  1. It shows a really quick way to use OpenMP, which has a 10x speedup.

  2. It shows a really quick way to use CUDA, which has a 15x speedup. This just uses one thread block per input matrix A row, and one thread per output element. That's 1M threads. The data is read from managed memory. Note how easy it is. There is no storing tiles into fast shared memory.

  3. It multiplies the matrices on the host, and compares reading them from normal memory and from managed memory. The latter is 2.5x slower. Dunno why.

  4. matmul2 also shows some of my utility functions and macros.

    1. InitCUDA prints a description of the GPU etc.
    2. cout << PRINTC(expr) prints an expression's name and then its value and a comma.
    3. PRINTN is like PRINTC but ends with endl.
    4. TIME(expr) evals an expression then prints its name and total and delta
      1. CPU time,
      2. elapsed time,
      3. their ratio.
    5. CT evals and prints the elapsed time of a CUDA kernel.
    6. Later I may add new tests to this.
  5. /parallel-class/cuda/checksum.cc shows a significant digits problem when you add many small numbers.

  6. sum_reduction.cu is Stanford's program.

  7. sum_reduction2.cu is my modification to use managed memory.

    Note how both sum_reduction and sum_reduction2 give different answers for the serial and the parallel computation. That is bad.

  8. sum_reduction3.cu is a mod to try to find the problem. One problem is insufficient precision in the sum. Using double works. However there might be other problems.

3   Computer factoid

Unrelated to this course, but perhaps interesting:

All compute servers have for decades had a microprocessor frontend that controls the boot process. The current iteration is called an IPMI. It has a separate ethernet port, and would allow remote bios configs and booting. On parallel, that port is not connected (I don't trust the security).

4   CUDA Doc

The start of Nvidia's Parallel thread execution has useful info.

This is one of a batch of CUDA docs. Browse as you wish.

5   Stanford lectures

All the lectures and associated files are on geoxeon and parallel.ecse at /parallel-class/stanford/ .

They are also online here.

Lecture 6 parallel patterns 1 presents some paradigms of parallel programming. These are generally useful building blocks for parallel algorithms.

6   Thrust

Stanford lecture 8: Thrust. This is a functional frontend to various backends, such as CUDA.

Programs are in /parallel-class/thrust/ .