PAR Homework 5, due Thu 2020-02-20, noon

W Randolph Franklin (WRF), Rensselaer Polytechnic Institute (RPI)

2020-02-13 00:00

Rules

Submit the answers to Gradescope.
You may do homeworks in teams of 2 students. Create a gradescope team and make one submission with both names.
For redundancy, at the top of your submitted homework write what it is and who you are. E.g., "Parallel Homework 2, 1/30/20, by Boris Badenov and Natasha Fatale".
Each question is 10 points.

Questions

Assume that a kernel is launched with 1000 thread blocks each of which has 512 threads. If a variable is declared as a shared memory variable, how many versions of the variable will be created through the lifetime of the execution of the kernel? (module 4)
Assume that each atomic operation in a DRAM system has a total latency of 100ns. What is the maximal throughput we can get for atomic operations on the same global memory variable? (module 7)
For a processor that supports atomic operations in L2 cache, assume that each atomic operation takes 4ns to complete in L2 cache and 100ns to complete in DRAM. Assume that 90% of the atomic operations hit in L2 cache. What is the approximate throughput for atomic operations on the same global memory variable?

For the following basic reduction kernel code fragment, if the block size is 1024 and warp size is 32, how many warps in a block will have divergence during the iteration where stride is equal to 1? (module 9):

unsigned int t = threadIdx.x;
Unsigned unsigned int start = 2*blockIdx.x*blockDim.x;
partialSum[t] = input[start + t];
partialSum[blockDim.x+t] = input[start+ blockDim.x+t];
for (unsigned int stride = 1; stride <= blockDim.x; stride *= 2)
{
__syncthreads();
if (t % stride == 0) {partialSum[2*t]+= partialSum[2*t+stride];}
}

In the previous question, how many warps in a block will have divergence during the iteration where stride is equal to 16?
For the work inefficient scan kernel based on reduction trees, assume that we have 1024 elements, which of the following gives the closest approximation of the number of add operations performed? (module 10)
1. (1024-1)*2
2. (512-1)*2
3. 1024*1024
4. 1024*10
Which of the following statements is true? (module 14)
1. Data transfer between CUDA device and host is done by DMA hardware using virtual addresses.
2. The OS always guarantees that any memory being used by DMA hardware is not swapped out.
3. If a pageable data is to be transferred by cudyMemcpy(), it needs to be first copied to a pinned memory buffer before transferred.
4. Pinned memory is allocated with cudaMalloc() function.
What is the CUDA API call that makes sure that all previous kernel executions and memory copies in a device have been completed?
1. __syncthreads()
2. cudaDeviceSynchronize()
3. cudaStreamSynchronize()
4. __barrier()

Total: 90 pts.