PAR Class 15, Mon 2019-03-11

2   OpenMP: CPU time vs wall clock (elapsed) time

On parallel.ecse, /parallel-class/openmp/rpi/cpuvs.wall.cc is a simple program that shows how wallclock time shrinks as the number of threads rises (up to a point) but CPU time rises from the start.

That is because one way that a thread waits to run is to burn cycles. For small delays, this is easier and has less overhead than explicitly suspending. The GNU version of OpenMP lets you choose the crossover point between buring cycles and syspending a thread.

3   Thrust

3.1   Bug

I found and reported a bug in version 100904. This version does not work with OpenMP. It was immediately closed because they already knew about it. They just hadn't told us users.

Awhile ago I reported a bug in nvcc, where it went into an infinite loop for a certain array size. The next minor version of CUDA was released the next day.

Two observations:

  1. I'm good at breaking SW by expecting it to meet its specs.
  2. Nvidia is responsive, which I like.

3.2   More info

  1. /parallel-class/thrust/doc/An_Introduction_To_Thrust.pdf
  2. GTC_2010_Part_2_Thrust_By_Example.pdf

3.3   Examples ctd

The point of these examples is to show you some techniques for parallel programming that are not obvious. I.e., you probably wouldn't think of them if you did not already know them.

  1. repeated_range2.cu is my improvement on repeated_range.cu:

    auto output=make_permutation_iterator(data.begin(), make_transform_iterator(make_counting_iterator(0), _1/3));

    1. make_transform_iterator(make_counting_iterator(0), _1/3)) returns pointers to the sequence 0,0,0,1,1,1,2,2,2, ...
  2. Unmodified thrust examples:

    1. expand.cu takes a vector like V= [0, 10, 20, 30, 40] and a vector of repetition counts, like C= [2, 1, 0, 3, 1]. Expand repeats each element of V the appropriate number of times, giving [0, 0, 10, 30, 30, 30, 40]. The process is as follows.

      1. Since the output vector will be longer than the input, the main program computes the output size, by reduce summing C, and constructs a vector to hold the output.
      2. Exclusive_scan C to obtain output offsets for each input element: C2 = [0, 2, 3, 3, 6].
      3. Scatter_if the nonzero counts into their corresponding output positions. A counting iterator, [0, 1, 2, 3, 4] is mapped with C2, using C as the stencil, giving C3 = [0, 0, 1, 3, 0, 0, 4].
      4. An inclusive_scan with max fills in the holes in C3, to give C4 = [0, 0, 1, 3, 3, 3, 4].
      5. Gather uses C4 to gather elements of V: [0, 0, 10, 30, 30, 30, 40].
    2. set_operations.cu. This shows methods of handling an operation whose output is of unpredictable size. The question is, is space or time more important?

      1. If the maximum possible output size is reasonable, then construct an output vector of that size, use it, and then erase it down to its actual size.

      2. Or, run the operation twice. The 1st time, write to a discard_iterator, and remember only the size of the written data. Then, construct an output vector of exactly the right size, and run the operation again.

        I use this technique a lot with ragged arrays in sequential programs.

    3. sparse_vector.cu represents and sums sparse vectors.

      1. A sparse vector has mostly 0s.

      2. The representation is a vector of element indices and another vector of values.

      3. Adding two sparse vectors goes as follows.

        1. Allocate temporary index and element vectors of the max possible size (the sum of the sizes of the two inputs).

        2. Catenate the input vectors.

        3. Sort by index.

        4. Find the number of unique indices by applying inner_product with addition and not-equal-to-next-element to the indices, then adding one.

          E.g., applied to these indices: [0, 3, 3, 4, 5, 5, 5, 8], it gives 5.

        5. Allocate exactly enough space for the output.

        6. Apply reduce_by_key to the indices and elements to add elements with the same keys.

          The size of the output is the number of unique keys.

  3. What's the best way to sort 16000 sets of 1000 numbers each? E.g., sort the rows of a 16000x1000 array? On geoxeon, /pc/thrust/rpi/comparesorts.cu, which I copied from http://stackoverflow.com/questions/28150098/how-to-use-thrust-to-sort-the-rows-of-a-matrix|stackoverflow, compares three methods.

  4. Call the thrust sort 16000 times, once per set. That took 10 secs.

  5. Sort the whole list of 16,000,000 numbers together. Then sort it again by key, with the keys being the set number, to bring the elements of each set together. Since the sort is stable, this maintains the order within each set. (This is also how radix sort works.) That took 0.04 secs.

  6. Call a thrust function (to sort a set) within another thrust function (that applies to each set). This is new in Thrust 1.8. That took 0.3 secs.

This is a surprising and useful paradigm. It works because

  1. There's an overhead to starting each thrust function, and
  2. Radix sort, which thrust uses for ints, takes linear time.

3.4   Backends

  1. The Thrust device can be CUDA, OpenMP, TBB, etc.

  2. You can spec it in 2 ways:

    1. by adding an extra arg at the start of a function's arg list.

    2. with an envar

      https://github.com/thrust/thrust/wiki/Host-Backends

      https://github.com/thrust/thrust/wiki/Device-Backends