PAR Class 18, Mon 2021-03-29

W Randolph Franklin (WRF), Rensselaer Polytechnic Institute (RPI)

2021-03-29 00:00

Source

Table of contents

1 Optional day off

Would the class like a day off in the next week or two?

2 Thrust

Stanford's parallel course notes.

Starting at Lecture 8, slide 22.

Lectures 9 and later will be left for you to browse if you're interested.
IMO the syntax for the zip iterator could have been simpler. When I use it, I wrap it in a simpler interface.
Nvidia has (at least) 3 official locations for Thrust, and they don't all have the same version, and don't necessarily have docs and examples.
1. As part of CUDA, /local/cuda/targets/x86_64-linux/include/thrust/
2. As part of the HPC.
3. In the github repository, https://github.com/NVIDIA/thrust
The most comprehensive doc is online at https://docs.nvidia.com/cuda/thrust/index.html

This is up-to-date, and precise. However, it is only a summary.
There is also http://thrust.github.io/doc/index.html , but it is badly written and slightly obsolete.
There are easy-to-read various tutorials online. However they are mostly obsolescent. E.g., they don't use C++-11 lambdas, which are a big help.
Also, none of the non-Nvidia docs mention the recent unified memory additions.
Look at some Thrust programs and documentation in 2021/files/thrust/
There are other alternatives like Kokkos.
The alternatives are lower-level (= faster and harder to use) and newer (= possibly less debugged, fewer users).
For awhile it looked like Nvidia had stopped developing Thrust, but they've started up again. Good.
On parallel in 2021/files/thrust/ are many little demo programs from the thrust distribution, with my additions.
Thrust is fast because the functions that look like they would need linear time really take only log time in parallel.
In functions like reduce and transform, you often see an argument like thrust::multiplies<float>(). The syntax is as follows:
1. thrust::multiplies<float> is a class.
2. It overloads operator().
3. However, in the call to reduce, thrust::multiplies<float>() is calling the default constructor to construct a variable of class thrust::multiplies<float>, and passing it to reduce.
4. reduce will treat its argument as a function name and call it with an argument, triggering operator().
5. You may also create your own variable of that class, e.g., thrust::multiplies<float> foo. Then you just say foo in the argument list, not foo().
6. The optimizing compiler will replace the operator() function call with the defining expression and then continue optimizing. So, there is no overhead, unlike if you passed in a pointer to a function.
Sometimes, e.g., in saxpy.cu, you see saxpy_functor(A).
1. The class saxpy_functor has a constructor taking one argument.
2. saxpy_functor(A) constructs and returns a variable of class saxpy_functor and stores A in the variable.
3. The class also overloads operator().
4. (Let's call the new variable foo). foo() calls operator() for foo; its execution uses the stored A.
5. Effectively, we did a closure of saxpy_functor; this is, we bound a property and returned a new, more restricted, variable or class.

2.1 Bug

I found and reported a bug in version 100904. This version does not work with OpenMP. It was immediately closed because they already knew about it. They just hadn't told us users.

Awhile ago I reported a bug in nvcc, where it went into an infinite loop for a certain array size. The next minor version of CUDA was released the next day.

Two observations:

I'm good at breaking SW by expecting it to meet its specs.
Nvidia is responsive, which I like.

2.2 CUDACast videos on Thrust

I won't cover these in class; they're presented in case you're interested.

2.3 Alternate docs in parallel-class/2021/files/thrust/doc

An_Introduction_To_Thrust.pdf
GTC_2010_Part_2_Thrust_By_Example.pdf

We'll look at this starting at slide 27. It shows parallel programming paradigms.