PAR Class 21, Mon 2021-04-12

W Randolph Franklin (WRF), Rensselaer Polytechnic Institute (RPI)

Table of contents

1 Final projects

See the syllabus.
April 19: team and title, and 1-2 minute proposal presentation to class.
April 26: 100 word project report.
April 26, 29, or May 3: 10-15 minute presentations. Email me with preferred dates.
May 5: report due.

Universal_vector uses unified memory. No need for host vs device. E.g.,

/parclass/2021/files/thrust/rpi/tiled_range2u.cu vs tiled_range2.cu
This might have a 2x performance penalty. Dunno.
You can force a function to execute on the host, or on the device, with an extra 1st argument.
There are compiler switches to define what the host or device should be. Common devices are host single thread, host OpenMP, host TBB, GPU. You could conceivably add your own.

The theory is that no source code changes are required.

OpenMP, OpenACC, Thrust, CUDA, C++17, ...
https://developer.nvidia.com/blog/accelerating-standard-c-with-gpus-using-stdpar/
https://developer.nvidia.com/hpc-compilers
There are other interesting pages.
To the first order, they have similar performance. That's dominated by data transfer, which they do the same.
C++17 may have the greatest potential, but I don't know whether it's mature enough yet.

I use thrust (and OpenMP) to for some geometry (CAD or GIS) research doing efficient algorithms on large datasets.
One example is to process a set of 3D points to find all pairs closer than a given distance.
Another is to overlay two polyhedra or triangulations to find all intersecting pieces. That uses big rational arithmetric to avoid roundoff errors and Simulation of Simplicity to handle degenerate cases.