# PAR Homework 3

Hand in your solution on RPILMS, unless instructions say otherwise. Each team should submit its solution under only 1 student's name. The other student's submission should just name the lead student. (This makes it easier for us to avoid grading it twice.)

If you have problems, then ask for help. The goal is to learn the material.

## Homework 3, due Fri 2017-02-24, 9am.

### Paper questions

1. Research and then describe the main changes from NVidia Maxwell to Pascal.
2. Although a thread can use 255 registers, that might be bad for performance. Why?
3. Give a common way that the various threads in a block can share data with each other.
4. Reading a word from global memory might take 400 cycles. Does that mean that a thread that reads many words from global memory will always take hundreds of times longer to complete?
5. Since the threads in a warp are executed in a SIMD fashion, how can an if-then-else block be executed?
6. What is unified virtual addressing and how does it make CUDA programming easier?

### Programming questions

1. Repeat homework 2's matrix multiplication problem, this time in CUDA. Report how much parallel speedup you get.

2. Look at the dataset /parallel-class/data/bunny. It contains 35947 points for the Stanford bunny.

Assuming that each point has a mass of 1, and is gravitationally attracted to the others, compute the potential energy of the system. The formula is this:

$U = - \sum_{i=1}^{N-1} \sum_{j=i+1}^N \frac{1}{r_{ij}}$

where $r_{ij}$ is the distance between points $i$ and $j$ . (This assumes that G=1).

3. Now look at the dataset /parallel-class/data/blade, which contains 882954 points for a turbine blade. Can you process it?