PAR Lecture 18, Mon Mar 27

W Randolph Franklin, RPI

2017-03-27 00:00

Comments

Table of contents

1 Thrust
- 1.1 Examples
- 1.2 Backends

1 Thrust

1.1 Examples

I rewrote /parallel-class/thrust/examples-1.8/tiled_range.cu into /parallel-class/thrust/rpi/tiled_range2.cu .

It is now much shorter and much clearer. All the work is done here:

gather(make_transform_iterator(make_counting_iterator(0), _1%N), make_transform_iterator(make_counting_iterator(N*C), _1%N), data.begin(), V.begin());
1. make_counting_iterator(0) returns pointers to the sequence 0, 1, 2, ...
2. _1%N is a function computing modulo N.
3. make_transform_iterator(make_counting_iterator(0), _1%N) returns pointers to the sequence 0%N, 1%N, ...
4. gather populates V. The i-th element of V gets make_transform_iterator...+i element of data, i.e., the i%N-th element of data.
tiled_range3.cu is even shorter. Instead of writing an output vector, it constructs an iterator for a virtual output vector:

auto output=make_permutation_iterator(data, make_transform_iterator(make_counting_iterator(0), _1%N));
1. *(output+i) is *(data+(i%N)).
2. You can get as many tiles as you want by iterating.
3. tiled_range3.cu also constructs an iterator for a virtual input vector (in this case a vector of squares) instead of storing the data:
auto data = make_transform_iterator(make_counting_iterator(0), _1*_1);
tiled_range5.cu shows how to use a lambda instead of the _1 notation:

auto output=make_permutation_iterator(data, make_transform_iterator(make_counting_iterator(0), [](const int i){return i%N;} ));
1. You have to compile with --std c++11 .
2. This can be rewritten thus:
  
  auto f = [](const int i){return i%N;}; auto output = make_permutation_iterator(data, make_transform_iterator(make_counting_iterator(0), f));
3. The shortest lambda is this:
  
  auto f = [](){};
repeated_range2.cu is my improvement on repeated_range.cu:

auto output=make_permutation_iterator(data.begin(), make_transform_iterator(make_counting_iterator(0), _1/3));
1. make_transform_iterator(make_counting_iterator(0), _1/3)) returns pointers to the sequence 0,0,0,1,1,1,2,2,2, ...
Unmodified thrust examples:
1. expand.cu takes a vector like V= [0, 10, 20, 30, 40] and a vector of repetition counts, like C= [2, 1, 0, 3, 1]. Expand repeats each element of V the appropriate number of times, giving [0, 0, 10, 30, 30, 30, 40]. The process is as follows.
  1. Since the output vector will be longer than the input, the main program computes the output size, byt reduce summing C, and constructs a vector to hold the output.
  2. Exclusive_scan C to obtain output offsets for each input element: C2 = [0, 2, 3, 3, 6].
  3. Scatter_if the nonzero counts into their corresponding output positions. A counting iterator, [0, 1, 2, 3, 4] is mapped with C2, using C as the stencil, giving C3 = [0, 0, 1, 3, 0, 0, 4].
  4. An inclusive_scan with max fills in the holes in C3, to give C4 = [0, 0, 1, 3, 3, 3, 4].
  5. Gather uses C4 to gather elements of V: [0, 0, 10, 30, 30, 30, 40].
2. set_operations.cu. This shows methods of handling an operation whose output is of unpredictable size. The question is, is space or time more important?
  1. If the maximum possible output size is reasonable, then construct an output vector of that size, use it, and then erase it down to its actual size.
  2. Or, run the operation twice. The 1st time, write to a discard_iterator, and remember only the size of the written data. Then, construct an output vector of exactly the right size, and run the operation again.
    
    I use this technique a lot with ragged arrays in sequential programs.
3. sparse_vector.cu represents and sums sparse vectors.
  1. A sparse vector has mostly 0s.
  2. The representation is a vector of element indices and another vector of values.
  3. Adding two sparse vectors goes as follows.
    1. Allocate temporary index and element vectors of the max possible size (the sum of the sizes of the two inputs).
    2. Catenate the input vectors.
    3. Sort by index.
    4. Find the number of unique indices by applying inner_product with addition and not-equal-to-next-element to the indices, then adding one.
      
      E.g., applied to these indices: [0, 3, 3, 4, 5, 5, 5, 8], it gives 5.
    5. Allocate exactly enough space for the output.
    6. Apply reduce_by_key to the indices and elements to add elements with the same keys.
      
      The size of the output is the number of unique keys.
What's the best way to sort 16000 sets of 1000 numbers each? E.g., sort the rows of a 16000x1000 array? On geoxeon, @@/pc/thrust/rpi/comparesorts.cu@@, which I copied from http://stackoverflow.com/questions/28150098/how-to-use-thrust-to-sort-the-rows-of-a-matrix|stackoverflow, compares three methods.
Call the thrust sort 16000 times, once per set. That took 10 secs.
Sort the whole list of 16,000,000 numbers together. Then sort it again by key, with the keys being the set number, to bring the elements of each set together. Since the sort is stable, this maintains the order within each set. (This is also how radix sort works.) That took 0.04 secs.
Call a thrust function (to sort a set) within another thrust function (that applies to each set). This is new in Thrust 1.8. That took 0.3 secs.

This is a surprising and useful paradigm. It works because

There's an overhead to starting each thrust function, and
Radix sort, which thrust uses for ints, takes linear time.

1.2 Backends

The Thrust device can be CUDA, OpenMP, TBB, etc.
You can spec it in 2 ways:
1. by adding an extra arg at the start of a function's arg list.
2. with an envar
  
  https://github.com/thrust/thrust/wiki/Host-Backends
  
  https://github.com/thrust/thrust/wiki/Device-Backends