Skip to main content

PAR Class 16, Mon 2022-03-14

1 Thrust - 2

  1. I'll lecture from .../files/nvidia/thrust/doc/GTC_2010_Part_2_Thrust_By_Example.pdf on parallel.ecse

    Why something from 2010? It's well presented, hasn't been updated by NVidia, and is current, apart from not using lambdas.

    The deeper meaning of this stuff is the set of parallel programming and algorithm paradigms.

  2. On parallel in .../files/nvidia/thrust are many little demo programs from the thrust distribution, with my additions.

  3. Thrust is fast because the functions that look like they would need linear time really take only log time in parallel.

  4. In functions like reduce and transform, you often see an argument like thrust::multiplies<float>(). The syntax is as follows:

    1. thrust::multiplies<float> is a class.

    2. It overloads operator().

    3. However, in the call to reduce, thrust::multiplies<float>() is calling the default constructor to construct a variable of class thrust::multiplies<float>, and passing it to reduce.

    4. reduce will treat its argument as a function name and call it with an argument, triggering operator().

    5. You may also create your own variable of that class, e.g., thrust::multiplies<float> foo. Then you just say foo in the argument list, not foo().

    6. The optimizing compiler will replace the operator() function call with the defining expression and then continue optimizing. So, there is no overhead, unlike if you passed in a pointer to a function.

  5. Sometimes, e.g., in saxpy.cu, you see saxpy_functor(A).

    1. The class saxpy_functor has a constructor taking one argument.

    2. saxpy_functor(A) constructs and returns a variable of class saxpy_functor and stores A in the variable.

    3. The class also overloads operator().

    4. (Let's call the new variable foo). foo() calls operator() for foo; its execution uses the stored A.

    5. Effectively, we did a closure of saxpy_functor; this is, we bound a property and returned a new, more restricted, variable or class.

1.1 Bug

I found and reported a bug in version 100904. This version does not work with OpenMP. It was immediately closed because they already knew about it. They just hadn't told us users.

Awhile ago I reported a bug in nvcc, where it went into an infinite loop for a certain array size. The next minor version of CUDA was released the next day.

Two observations:

  1. I'm good at breaking SW by expecting it to meet its specs.

  2. Nvidia is responsive, which I like.