PAR Class 20, Mon 2021-04-05

W Randolph Franklin (WRF), Rensselaer Polytechnic Institute (RPI)

2021-04-05 00:00

Source

Table of contents

1 Thrust, ctd

Examples
1. bucket_sort2d.cu overlays a grid on a set of 2D points and finds the points in each grid cell (bucket).
  1. The tuple is an efficient class for a short vector of fixed length.
  2. Note how random numbers are generated. You combine an engine that produces random output with a distribution.
    
    However you might need more complicated coding to make the numbers good when executing in parallel. See monte_carlo_disjoint_sequences.cu.
  3. The problem is that the number of points in each cell is unpredictable.
  4. The cell containing each point is computed and that and the points are sorted to bring together the points in each cell.
  5. Then lower_bound and upper_bound are used to find each bucket in that sorted vector of points.
  6. See this lower_bound description.
2. mode.cu shows:
  1. Counting the number of unique keys in a vector.
    1. Sort the vector.
    2. Do an inner_product. However, instead of the operators being times and plus, they are not equal to the next element and plus.
  2. Counting their multiplicity.
    1. Construct vectors, sized at the number of unique keys, to hold the unique keys and counts.
    2. Do a reduce_by_keys on a constant_iterator using the sorted vector as the keys. For each range of identical keys, it sums the constant_iterator. That is, it counts the number of identical keys.
    3. Write a vector of unique keys and a vector of the counts.
  3. Finding the most used key (the mode).
    1. Do max_element on the counts vector.
3. repeated_range.cu repeats each element of an N-vector K times: repeated_range([0, 1, 2, 3], 2) -> [0, 0, 1, 1, 2, 2, 3, 3]. It's a lite version of expand.cu, but uses a different technique.
  1. Here, N=4 and K=2.
  2. The idea is to construct a new iterator, repeated_range, that, when read and incremented, will return the proper output elements.
  3. The construction stores the relevant info in structure components of the variable.
  4. Treating its value like a subscript in the range [0,N*K), it divides that value by K and returns that element of its input.
  See also strided_range.cu and tiled_range.cu.

1.1 Universal vector; Backends

The Thrust device can be CUDA, OpenMP, TBB, etc.
You can spec it in 2 ways:
1. by adding an extra arg at the start of a function's arg list.
2. with an envar
  
  https://github.com/thrust/thrust/wiki/Host-Backends
  
  https://github.com/thrust/thrust/wiki/Device-Backends
Universal_vector uses unified memory. No need for host vs device. E.g.,

/parclass/2021/files/thrust/rpi/tiled_range2u.cu vs tiled_range2.cu
This might have a 2x performance penalty. Dunno.
You can force a function to execute on the host, or on the device, with an extra 1st argument.
There are compiler switches to define what the host or device should be. Common devices are host single thread, host OpenMP, host TBB, GPU. You could conceivably add your own.

The theory is that no source code changes are required.

2 Linux HMM (Heterogeneous Memory Management)

This is hardware support to make programming devices like GPUs easier. It took years to get into the official kernel, presumably because you don't want bugs in memory management.

This is also a nice intro into current operating systems concerns. Do our operating systems courses talk about this?

3 Several parallel Nvidia tools

OpenMP, OpenACC, Thrust, CUDA, C++17, ...
https://developer.nvidia.com/blog/accelerating-standard-c-with-gpus-using-stdpar/
https://developer.nvidia.com/hpc-compilers
There are other interesting pages.
To the first order, they have similar performance. That's dominated by data transfer, which they do the same.
Parallel STL (Standard Template Lib) incorporated a lot of Thrust.
C++17 may have the greatest potential since it incorporates other ideas.

4 Nvidia GTC

This week, free. https://www.nvidia.com/en-us/gtc/
Jensen Huang's keynote should be good.
Browse around.

5 My parallel research

I use thrust (and OpenMP) to for some geometry (CAD or GIS) research doing efficient algorithms on large datasets.
One example is to process a set of 3D points to find all pairs closer than a given distance.
Another is to overlay two polyhedra or triangulations to find all intersecting pieces. That uses big rational arithmetric to avoid roundoff errors and Simulation of Simplicity to handle degenerate cases.