PAR Class 19, Thu 2020-04-02

W Randolph Franklin (WRF), Rensselaer Polytechnic Institute (RPI)

2020-03-09 00:00

Source

Table of contents

1 Project proposal

To the several people who haven't uploaded it to gradescope: please do.

2 Homework 7 - status report

is online.

3 Thrust

Thrust is an API that looks like STL. Its backend can be CUDA, OpenMP, or sequential host-based code.
The online Thrust directory structure is a mess. Three main sites appear to be these:
1. https://github.com/thrust -
  1. The best way to install it is to clone from here.
  2. The latest version of the examples is also here.
  3. The wiki has a lot of doc.
2. https://thrust.github.io/
  
  This points to the above site.
3. https://developer.nvidia.com/thrust
  
  This has links to other Nvidia docs, some of which are obsolete.
4. http://docs.nvidia.com/cuda/thrust/index.html
  
  easy-to-read, thorough, obsolete, doc
5. https://code.google.com/ - no longer exists.
Functional-programming philosophy.
Many possible backends: host, GPU, OpenMP, TBB...
Easier programming, once you get used to it.
Code is efficient.
Uses some unusual C++ techniques, like overloading operator().
Since the Stanford slides were created, Thrust has adopted unified addressing, so that pointers know whether they are host or device.
On parallel in /parallel-class/thrust/ are many little demo programs from the thrust distribution, with my additions.
CUDACast videos on Thrust:

CUDACast #.15 - Introduction to Thrust

CUDACast #.16 - Thrust Algorithms and Custom Operators
Thrust is fast because the functions that look like they would need linear time really take only log time in parallel.
In functions like reduce and transform, you often see an argument like thrust::multiplies<float>(). The syntax is as follows:
1. thrust::multiplies<float> is a class.
2. It overloads operator().
3. However, in the call to reduce, thrust::multiplies<float>() is calling the default constructor to construct a variable of class thrust::multiplies<float>, and passing it to reduce.
4. reduce will treat its argument as a function name and call it with an argument, triggering operator().
5. You may also create your own variable of that class, e.g., thrust::multiplies<float> foo. Then you just say foo in the argument list, not foo().
6. The optimizing compiler will replace the operator() function call with the defining expression and then continue optimizing. So, there is no overhead, unlike if you passed in a pointer to a function.
Sometimes, e.g., in saxpy.cu, you see saxpy_functor(A).
1. The class saxpy_functor has a constructor taking one argument.
2. saxpy_functor(A) constructs and returns a variable of class saxpy_functor and stores A in the variable.
3. The class also overloads operator().
4. (Let's call the new variable foo). foo() calls operator() for foo; its execution uses the stored A.
5. Effectively, we did a closure of saxpy_functor; this is, we bound a property and returned a new, more restricted, variable or class.

3.1 Bug

I found and reported a bug in version 100904. This version does not work with OpenMP. It was immediately closed because they already knew about it. They just hadn't told us users.

Awhile ago I reported a bug in nvcc, where it went into an infinite loop for a certain array size. The next minor version of CUDA was released the next day.

Two observations:

I'm good at breaking SW by expecting it to meet its specs.
Nvidia is responsive, which I like.

3.2 Stanford's lectures

Continue Stanford's parallel course notes.

Lecture 8: Thrust ctd, starting at slides 24.

There's a lot of stuff here.

3.3 Alternate docs in /parallel-class/thrust/doc/

An_Introduction_To_Thrust.pdf
GTC_2010_Part_2_Thrust_By_Example.pdf