PAR Class 10, Wed 2018-03-28

W Randolph Franklin, RPI

2018-03-27 00:00

Table of contents

1 /parallel-class
2 OpenMP: CPU time vs wall clock (elapsed) time
3 More Thrust info
4 Linux programming tip
5 Thrust
- 5.1 Examples
- 5.2 Backends
6 Intel compilers on parallel
7 Programming the MIC (ctd)
8 OpenMP on the mic

1 /parallel-class

This dir on parallel.ecse has some of my modified programs that are not on geoxeon.ecse, such as tiled_range2.cu (mentioned below).

2 OpenMP: CPU time vs wall clock (elapsed) time

On parellel.ecse, /parallel-class/openmp/rpi/cpuvs.wall.cc is a simple program that shows how wallclock time shrinks as the number of threads rises (up to a point) but CPU time rises from the start.

That is because one way that a thread waits to run is to burn cycles. For small delays, this is easier and has less overhead than explicitly suspending. The GNU version of OpenMP lets you choose the crossover point between buring cycles and syspending a thread.

3 More Thrust info

/parallel-class/thrust/doc/An_Introduction_To_Thrust.pdf
GTC_2010_Part_2_Thrust_By_Example.pdf

4 Linux programming tip

I like to use emacs on my local machine to edit files on parallel.ecse. It's more responsive than running a remote editor, and lets me copy to and from local files. Here are 2 ways to do that.

In emacs on my local machine, I do find-file thus:

/parallel.ecse.rpi.edu:/parallel-class/thrust/rpi/tiled_range2.cu

Emacs transparently runs scp to read and write the remote file.
I can mount remote directories on my local machine with something like ssh-fs and then edit remote files as if they were local.

In either case, I have to compile the files on parallel.

5 Thrust

5.1 Examples

I rewrote /parallel-class/thrust/examples-1.8/tiled_range.cu into /parallel-class/thrust/rpi/tiled_range2.cu .

It is now much shorter and much clearer. All the work is done here:

gather(make_transform_iterator(make_counting_iterator(0), _1%N), make_transform_iterator(make_counting_iterator(N*C), _1%N), data.begin(), V.begin());
1. make_counting_iterator(0) returns pointers to the sequence 0, 1, 2, ...
2. _1%N is a function computing modulo N.
3. make_transform_iterator(make_counting_iterator(0), _1%N) returns pointers to the sequence 0%N, 1%N, ...
4. gather populates V. The i-th element of V gets make_transform_iterator...+i element of data, i.e., the i%N-th element of data.
tiled_range3.cu is even shorter. Instead of writing an output vector, it constructs an iterator for a virtual output vector:

auto output=make_permutation_iterator(data, make_transform_iterator(make_counting_iterator(0), _1%N));
1. *(output+i) is *(data+(i%N)).
2. You can get as many tiles as you want by iterating.
3. tiled_range3.cu also constructs an iterator for a virtual input vector (in this case a vector of squares) instead of storing the data:
auto data = make_transform_iterator(make_counting_iterator(0), _1*_1);
tiled_range5.cu shows how to use a lambda instead of the _1 notation:

auto output=make_permutation_iterator(data, make_transform_iterator(make_counting_iterator(0), [](const int i){return i%N;} ));
1. You have to compile with --std c++11 .
2. This can be rewritten thus:
  
  auto f = [](const int i){return i%N;}; auto output = make_permutation_iterator(data, make_transform_iterator(make_counting_iterator(0), f));
3. The shortest lambda is this:
  
  auto f = [](){};
repeated_range2.cu is my improvement on repeated_range.cu:

auto output=make_permutation_iterator(data.begin(), make_transform_iterator(make_counting_iterator(0), _1/3));
1. make_transform_iterator(make_counting_iterator(0), _1/3)) returns pointers to the sequence 0,0,0,1,1,1,2,2,2, ...
Unmodified thrust examples:
1. expand.cu takes a vector like V= [0, 10, 20, 30, 40] and a vector of repetition counts, like C= [2, 1, 0, 3, 1]. Expand repeats each element of V the appropriate number of times, giving [0, 0, 10, 30, 30, 30, 40]. The process is as follows.
  1. Since the output vector will be longer than the input, the main program computes the output size, by reduce summing C, and constructs a vector to hold the output.
  2. Exclusive_scan C to obtain output offsets for each input element: C2 = [0, 2, 3, 3, 6].
  3. Scatter_if the nonzero counts into their corresponding output positions. A counting iterator, [0, 1, 2, 3, 4] is mapped with C2, using C as the stencil, giving C3 = [0, 0, 1, 3, 0, 0, 4].
  4. An inclusive_scan with max fills in the holes in C3, to give C4 = [0, 0, 1, 3, 3, 3, 4].
  5. Gather uses C4 to gather elements of V: [0, 0, 10, 30, 30, 30, 40].
2. set_operations.cu. This shows methods of handling an operation whose output is of unpredictable size. The question is, is space or time more important?
  1. If the maximum possible output size is reasonable, then construct an output vector of that size, use it, and then erase it down to its actual size.
  2. Or, run the operation twice. The 1st time, write to a discard_iterator, and remember only the size of the written data. Then, construct an output vector of exactly the right size, and run the operation again.
    
    I use this technique a lot with ragged arrays in sequential programs.
3. sparse_vector.cu represents and sums sparse vectors.
  1. A sparse vector has mostly 0s.
  2. The representation is a vector of element indices and another vector of values.
  3. Adding two sparse vectors goes as follows.
    1. Allocate temporary index and element vectors of the max possible size (the sum of the sizes of the two inputs).
    2. Catenate the input vectors.
    3. Sort by index.
    4. Find the number of unique indices by applying inner_product with addition and not-equal-to-next-element to the indices, then adding one.
      
      E.g., applied to these indices: [0, 3, 3, 4, 5, 5, 5, 8], it gives 5.
    5. Allocate exactly enough space for the output.
    6. Apply reduce_by_key to the indices and elements to add elements with the same keys.
      
      The size of the output is the number of unique keys.
What's the best way to sort 16000 sets of 1000 numbers each? E.g., sort the rows of a 16000x1000 array? On geoxeon, /pc/thrust/rpi/comparesorts.cu, which I copied from http://stackoverflow.com/questions/28150098/how-to-use-thrust-to-sort-the-rows-of-a-matrix|stackoverflow, compares three methods.
Call the thrust sort 16000 times, once per set. That took 10 secs.
Sort the whole list of 16,000,000 numbers together. Then sort it again by key, with the keys being the set number, to bring the elements of each set together. Since the sort is stable, this maintains the order within each set. (This is also how radix sort works.) That took 0.04 secs.
Call a thrust function (to sort a set) within another thrust function (that applies to each set). This is new in Thrust 1.8. That took 0.3 secs.

This is a surprising and useful paradigm. It works because

There's an overhead to starting each thrust function, and
Radix sort, which thrust uses for ints, takes linear time.

5.2 Backends

The Thrust device can be CUDA, OpenMP, TBB, etc.
You can spec it in 2 ways:
1. by adding an extra arg at the start of a function's arg list.
2. with an envar
  
  https://github.com/thrust/thrust/wiki/Host-Backends
  
  https://github.com/thrust/thrust/wiki/Device-Backends

6 Intel compilers on parallel

Intel Parallel Studio XE Cluster 2017 is now installed on parallel, in /opt/intel/ . It is a large package with compilers, debuggers, analyzers, MPI, etc, etc. There is is extensive doc on Intel's web site. Have fun.

Students and free SW developers can get also free licenses for their machines. Commercial licenses cost $thousands.
What’s Inside Intel Parallel Studio XE 2017. There're PDF slides, a webinar, and training stuff.
https://goparallel.sourceforge.net/want-faster-code-faster-check-parallel-studio-xe/
Before using the compiler, you should initialize some envars thus:
```
source /opt/intel/bin/iccvars.sh -arch intel64
```

Then you can compile a C or C++ program thus:

icc -openmp -O3 foo.c -o foo
icpc -qopenmp -O3 foo.cc -o foo

On my simple tests, not using the mic, icpc and g++ produced equally good code.

Compile a C++ program with OpenMP thus:

icpc -qopenmp -std=c++11    omp_hello.cc   -o omp_hello

Test it thus, it is in /parallel-class/mic
```
OMP_NUM_THREADS=4 ./omp_hello
```
Note how the output from the various threads is mixed up.

7 Programming the MIC (ctd)

Dr.Dobb's: Programming the Xeon Phi by Rob Farber.
Some MIC demos .
To cross compile with icc and icpc, see the MPSS users guide, Section 8.1.4. Use the -mmic flag.
Intel OpenMP* Support Overview.
New Era for OpenMP*: Beyond Traditional Shared Memory Parallel Programming.
Book: Parallel Programming and Optimization with Intel® Xeon Phi™ Coprocessors, 2nd Edition [508 Pages].
See also https://www.cilkplus.org/
Rolling Out the New Intel Xeon Phi Processor at ISC 2016 https://www.youtube.com/watch?v=HDPYymREyV8
Supermicro Showcases Intel Xeon Phi and Nvidia P100 Solutions at ISC 2016 https://www.youtube.com/watch?v=nVWqSjt6hX4
Insider Look: New Intel® Xeon Phi™ processor on the Cray® XC™ Supercomputer https://www.youtube.com/watch?v=lkf3U_5QG_4

8 OpenMP on the mic

OpenMP is now running on the mic (Xeon Phi).

Setup envars thus (assuming you're using bash or zsh):

source /opt/intel/bin/iccvars.sh arch intel64
export SINK_LD_LIBRARY_PATH=/opt/intel/lib/mic

To compile on parallel for running on parallel:
```
icpc -fopenmp sum_reduc2.cc -o sum_reduc2
```
Run it on parallel thus:
```
./sum_reduc2
```

To cross compile on parallel for running on mic0:

icpc -mmic -fopenmp sum_reduc2.cc -o sum_reduc2-mic

Run it natively on mic0 thus:
```
micnativeloadex sum_reduc2-mic
```
It is also possible to have an OpenMP program on parallel.ecse execute parallel threads on mic0:
```
#pragma offload target (mic)
  {
  #pragma omp parallel
    {
    ...
    }
  }
```
See /parallel-class/mic/stackoverflow/hello.c
Compile it thus:
```
icc -fopenmp hello.c -o hello
```
Note that there is no -mmic.
Run it thus:
```
./hello
```
/parallel-class/mic/stackoverflow/hello_omp.c shows a different syntax.
Degrees of parallelism: See slide 12 of https://software.intel.com/sites/default/files/managed/87/c3/using-nested-parallelism-in-openMP-r1.pdf