PAR Class 19, Thu 2020-04-02
Table of contents
1 Project proposal
To the several people who haven't uploaded it to gradescope: please do.
2 Homework 7 - status report
is online.
3 Thrust
-
Thrust is an API that looks like STL. Its backend can be CUDA, OpenMP, or sequential host-based code.
-
The online Thrust directory structure is a mess. Three main sites appear to be these:
-
- The best way to install it is to clone from here.
- The latest version of the examples is also here.
- The wiki has a lot of doc.
-
This points to the above site.
-
https://developer.nvidia.com/thrust
This has links to other Nvidia docs, some of which are obsolete.
-
http://docs.nvidia.com/cuda/thrust/index.html
easy-to-read, thorough, obsolete, doc
-
https://code.google.com/ - no longer exists.
-
-
Functional-programming philosophy.
-
Many possible backends: host, GPU, OpenMP, TBB...
-
Easier programming, once you get used to it.
-
Code is efficient.
-
Uses some unusual C++ techniques, like overloading operator().
-
Since the Stanford slides were created, Thrust has adopted unified addressing, so that pointers know whether they are host or device.
-
On parallel in /parallel-class/thrust/ are many little demo programs from the thrust distribution, with my additions.
-
CUDACast videos on Thrust:
-
Thrust is fast because the functions that look like they would need linear time really take only log time in parallel.
-
In functions like reduce and transform, you often see an argument like thrust::multiplies<float>(). The syntax is as follows:
- thrust::multiplies<float> is a class.
- It overloads operator().
- However, in the call to reduce, thrust::multiplies<float>() is calling the default constructor to construct a variable of class thrust::multiplies<float>, and passing it to reduce.
- reduce will treat its argument as a function name and call it with an argument, triggering operator().
- You may also create your own variable of that class, e.g., thrust::multiplies<float> foo. Then you just say foo in the argument list, not foo().
- The optimizing compiler will replace the operator() function call with the defining expression and then continue optimizing. So, there is no overhead, unlike if you passed in a pointer to a function.
-
Sometimes, e.g., in saxpy.cu, you see saxpy_functor(A).
- The class saxpy_functor has a constructor taking one argument.
- saxpy_functor(A) constructs and returns a variable of class saxpy_functor and stores A in the variable.
- The class also overloads operator().
- (Let's call the new variable foo). foo() calls operator() for foo; its execution uses the stored A.
- Effectively, we did a closure of saxpy_functor; this is, we bound a property and returned a new, more restricted, variable or class.
3.1 Bug
I found and reported a bug in version 100904. This version does not work with OpenMP. It was immediately closed because they already knew about it. They just hadn't told us users.
Awhile ago I reported a bug in nvcc, where it went into an infinite loop for a certain array size. The next minor version of CUDA was released the next day.
Two observations:
- I'm good at breaking SW by expecting it to meet its specs.
- Nvidia is responsive, which I like.
3.2 Stanford's lectures
-
Continue Stanford's parallel course notes.
Lecture 8: Thrust ctd, starting at slides 24.
There's a lot of stuff here.
3.3 Alternate docs in /parallel-class/thrust/doc/
- An_Introduction_To_Thrust.pdf
- GTC_2010_Part_2_Thrust_By_Example.pdf