PAR Class 14, Thu 2019-02-28
Table of contents::
1 Linux HMM (Heterogeneous Memory Management)
- https://www.kernel.org/doc/html/v4.18/vm/hmm.html
- https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=5&ved=2ahUKEwiCuM-y2t7gAhUD24MKHZWlCeYQFjAEegQIBhAC&url=http%3A%2F%2Fon-demand.gputechconf.com%2Fgtc%2F2017%2Fpresentation%2Fs7764_john-hubbardgpus-using-hmm-blur-the-lines-between-cpu-and-gpu.pdf&usg=AOvVaw1c7bYo2YO5n8OtD0Vw9hbs
2 Several forms of C++ functions
-
Traditional top level function
auto add(int a, int b) { return a+b;}
You can pass this to a function, really pass a pointer to the function. It doesn't optimize across the call.
-
Overload operator() in a new class
Each different variable of the class is a different function. The function can use the variable's value. This is a closure.
This is local to the containing block.
This form optimizes well.
-
Lambda, or anon function.
auto add = [](int a, int b) { return a+b;};
This is local to the containing block.
This form optimizes well.
-
Placeholder notation.
As an argument in, e.g., transform, you can do this:
transform(..., _1+_2);
This is nice and short.
As this is implemented by overloading the operators, the syntax of the expression is limited to what was overloaded.
3 Thrust
- Continue Stanford's parallel course notes.
- Lecture 8-: Thrust ctd.
3.1 Examples
-
I rewrote /parallel-class/thrust/examples-1.8/tiled_range.cu into /parallel-class/thrust/rpi/tiled_range2.cu .
It is now much shorter and much clearer. All the work is done here:
gather(make_transform_iterator(make_counting_iterator(0), _1%N), make_transform_iterator(make_counting_iterator(N*C), _1%N), data.begin(), V.begin());
- make_counting_iterator(0) returns pointers to the sequence 0, 1, 2, ...
- _1%N is a function computing modulo N.
- make_transform_iterator(make_counting_iterator(0), _1%N) returns pointers to the sequence 0%N, 1%N, ...
- gather populates V. The i-th element of V gets make_transform_iterator...+i element of data, i.e., the i%N-th element of data.
-
tiled_range3.cu is even shorter. Instead of writing an output vector, it constructs an iterator for a virtual output vector:
auto output=make_permutation_iterator(data, make_transform_iterator(make_counting_iterator(0), _1%N));
- *(output+i) is *(data+(i%N)).
- You can get as many tiles as you want by iterating.
- tiled_range3.cu also constructs an iterator for a virtual input vector (in this case a vector of squares) instead of storing the data:
auto data = make_transform_iterator(make_counting_iterator(0), _1*_1);
-
tiled_range5.cu shows how to use a lambda instead of the _1 notation:
auto output=make_permutation_iterator(data, make_transform_iterator(make_counting_iterator(0), [](const int i){return i%N;} ));
-
You have to compile with --std c++11 .
-
This can be rewritten thus:
auto f = [](const int i){return i%N;}; auto output = make_permutation_iterator(data, make_transform_iterator(make_counting_iterator(0), f));
-
The shortest lambda is this:
auto f = [](){};
-
-
repeated_range2.cu is my improvement on repeated_range.cu:
auto output=make_permutation_iterator(data.begin(), make_transform_iterator(make_counting_iterator(0), _1/3));
- make_transform_iterator(make_counting_iterator(0), _1/3)) returns pointers to the sequence 0,0,0,1,1,1,2,2,2, ...