PAR Class 27, Mon 2021-05-03

W Randolph Franklin (WRF), Rensselaer Polytechnic Institute (RPI)

2021-05-02 00:00

Table of contents

1 Daily joke

Smart: Guy Creates An AI Clone Of Himself To Sit In On Zoom Meetings So He Doesn't Have To

2 Final project

2.2 Prior work

If your final project is building on, or sharing with, another course or project (say on GitHub), then you must give the details, and say what's new for this course.

2.3 Paper format

Try to use the IEEE conference format . It allows either latex or MS word. Submit the PDF paper to gradescope.

3 Inspiration for finishing your term projects

The Underhanded C Contest

"The goal of the contest is to write code that is as readable, clear, innocent and straightforward as possible, and yet it must fail to perform at its apparent function. To be more specific, it should do something subtly evil. Every year, we will propose a challenge to coders to solve a simple data processing problem, but with covert malicious behavior. Examples include miscounting votes, shaving money from financial transactions, or leaking information to an eavesdropper. The main goal, however, is to write source code that easily passes visual inspection by other programmers."
The International Obfuscated C Code Contest
https://www.awesomestories.com/asset/view/Space-Race-American-Rocket-Failures

Moral: After early disasters, sometimes you can eventually get things to work.
The 'Wrong' Brothers Aviation's Failures (1920s)
Early U.S. rocket and space launch failures and explosion

4 Software tips

4.1 Git

Git is good to simultaneously keep various versions. Here's a git intro:

Create a dir for the project:

mkdir PROJECT; cd PROJECT

Initialize:

git init

Create a branch (you can do this several times):

git branch MYBRANCHNAME

Go to a branch:

git checkout MYBRANCHNAME

Do things:

vi, make, ....

Save it:

git add .; git commit -mCOMMENT

Repeat

At times I've used this to modify a program for class while keeping a copy of the original.

4.2 Freeze decisions early: SW design paradigm

One of my rules is to push design decisions to take effect as early in the process execution as possible. Constructing variables at compile time is best, at function call time is 2nd, and on the heap is worst.

If I have to construct variables on the heap, I construct few and large variables, never many small ones.
Often I compile the max dataset size into the program, which permits constructing the arrays at compile time. Recompiling for a larger dataset is quick (unless you're using CUDA).

Accessing this type of variable uses one less level of pointer than accessing a variable on the heap. I don't know whether this is faster with a good optimizing compiler, but it's probably not slower.
If the data will require a dataset with unpredictably sized components, such as a ragged array, then I may do the following.
1. Read the data once to accumulate the necessary statistics.
2. Construct the required ragged array.
3. Reread the data and populate the array.

Update: However, with CUDA, maybe managed variables must be on the heap.

5 Exascale computing

https://www.nextplatform.com/2019/12/04/openacc-cozies-up-to-c-c-and-fortran-standards/

https://www.nextplatform.com/2019/01/09/two-thirds-of-the-way-home-with-exascale-programming/

They agree with me that fewer bigger parallel processors are better than many smaller processors connected with MPI. I.e., they like my priorities in this course.

6 CPPCON

CppCon is the annual, week-long face-to-face gathering for the entire C++ community. https://cppcon.org/

https://cppcon2018.sched.com/

CppCon 2014: Herb Sutter "Paying for Lunch: C++ in the ManyCore Age" https://www.youtube.com/watch?v=AfI_0GzLWQ8

CppCon 2016: Combine Lambdas and weak_ptrs to make concurrency easy (4min) https://www.youtube.com/watch?v=fEnnmpdZllQ

A Pragmatic Introduction to Multicore Synchronization by Samy Al Bahra. https://www.youtube.com/watch?v=LX4ugnzwggg

CppCon 2018: Tsung-Wei Huang “Fast Parallel Programming using Modern C++” https://www.youtube.com/watch?v=ho9bqIJkvkc&list=PLHTh1InhhwT7GoW7bjOEe2EjyOTkm6Mgd&index=13

CppCon 2018: Anny Gakhokidze “Workflow hacks for developers” https://www.youtube.com/watch?v=K4XxeB1Duyo&list=PLHTh1InhhwT7GoW7bjOEe2EjyOTkm6Mgd&index=33

CppCon 2018: Bjarne Stroustrup “Concepts: The Future of Generic Programming (the future is here)” https://www.youtube.com/watch?v=HddFGPTAmtU

CppCon 2018: Herb Sutter “Thoughts on a more powerful and simpler C++ (5 of N)” https://www.youtube.com/watch?v=80BZxujhY38

CppCon 2018: Jefferson Amstutz “Compute More in Less Time Using C++ Simd Wrapper Libraries” https://www.youtube.com/watch?v=80BZxujhY38

CppCon 2018: Geoffrey Romer “What do you mean "thread-safe"?” https://www.youtube.com/watch?v=s5PCh_FaMfM

7 Supercomputing conference

https://supercomputing.org/

"The International Conference for High Performance Computing, Networking, Storage, and Analysis"

8 Jack Dongarra

He has visited RPI more than once.

Algorithms for future emerging technologies (2:55:45) https://www.youtube.com/watch?v=TCgHNMezmZ8

Stony Brook talk https://www.youtube.com/watch?v=D1hfrtoVZDo

9 Course survey

I see my role as a curator, selecting the best stuff to present to you. Since I like the topic, and asked to be given permission to create this course, I pick things I like since you might like them too.

So, if you liked the course, then please officially tell RPI by completing the survey and saying what you think. Thanks.

10 After the semester

I'm open to questions and discussions about any legal ethical topic. Even after you graduate.

PAR Class 26, Thurs 2021-04-29

W Randolph Franklin (WRF), Rensselaer Polytechnic Institute (RPI)

2021-04-29 00:00

Table of contents

1 Final project presentations

1.1 Thurs 4/29

Isaac
Joseph
Justin

1.2 Mon May 3

Ben and Jack
Blaine
Connor
Dan
Yunshi

2 Quantum Summary Page

My quantum computing intro.

qiskit, etc.

3 Using various quantum computers

PAR Class 25, Mon 2021-04-26

W Randolph Franklin (WRF), Rensselaer Polytechnic Institute (RPI)

2021-04-26 00:00

Table of contents

1 Final project presentations

1.1 Thurs 4/29

Isaac
Joseph
Justin
Mark

1.2 Mon May 3

Ben and Jack
Blaine
Connor
Dan
Yunshi

2 My Quantum Summary Page

My quantum computing intro.

no cloning, technologies, qiskit, etc.

3 Using IBM's quantum computer

https://quantum-computing.ibm.com/

PAR Class 24, Thurs 2021-04-22

W Randolph Franklin (WRF), Rensselaer Polytechnic Institute (RPI)

2021-04-21 00:00

Table of contents

1 Deliverables

1.1 Homework submission

You've been doing homeworks all semester although I haven't given you a way to submit them. They've now been added to gradescope. Please upload them now.

1.2 Project presentations dates

Please email me with your preferred final project presentation dates.

1.3 Progress report

Next Mon, a project progress report is due; upload it to gradescope.

2 Quantum computing ctd

2.1 Shor's algorithm

Shor on, what is Shor's factoring algorithm? (2:09)
Shor's Factoring Algorithm by Umesh Vazirani (25:42)

This is an excellent video, which there's no time to show.

This video is difficult because the topic is difficult, but important. At least you'll get a sense of it. This is what potentially breaks a lot of cryptosystems.
7. Shor's Algorithm I: Understanding Quantum Fourier Transform, Quantum Phase Estimation - Part 1 by Abraham Asfaw, IBM. (44:05).

This is from IBM's Introduction to Quantum Computing and Quantum Hardware. Lecture Notes and Labs: https://qiskit.org/learn/intro-qc-qh

This is the 1st of 5 videos describing the algorithm. There's no time for the rest, but this will give you an idea.

RPI's quantum group, including me, visited Asfaw last spring just before everything shut down.

2.2 HHL algorithm to solve a linear system of equations

Quantum Machine Learning - 37 - Overview of the HHL Algorithm 5:48.

quick, deep, intro.
Quantum algorithm for solving linear equations 36:31.

quite understandable, but no time.
1. HHL Algorithm
This is in Huawei HiQ, an open-source software framework for quantum computing.
https://en.wikipedia.org/wiki/Quantum_algorithm_for_linear_systems_of_equations

3 My Quantum Summary Page

continued from Quantum Properties, plus

What is an Ion Trap Quantum Computer? Simon Benjamin (2:30).
What is Quantum Annealing?, D-Wave Systems (6:14).

Elementary but sets the stage.
How The Quantum Annealing Process Works, D-Wave Systems (6:09)

PAR Class 23, Mon 2021-04-19

W Randolph Franklin (WRF), Rensselaer Polytechnic Institute (RPI)

2021-04-19 00:00

Table of contents

1 Student project proposals

1-2 minute presentations.
Upload your proposal to Gradescope, homework 10.

2 Quantum computing intro, 3

2.1 Videos

Quantum Computing for Computer Scientists (1:28:30), continuing where we stopped (about 1/2 hour in).
Quantum Computing Concepts – Quantum Hardware (3:22)
https://www.telegraph.co.uk/technology/2020/09/02/britain-must-act-fast-prevent-brain-drain-quantum-computing/ - good summary of programs around the world.

2.2 My summary, ctd

Starting from 4 Algorithm details in my quantum computing intro.

PAR Class 22, Thurs 2021-04-15

W Randolph Franklin (WRF), Rensselaer Polytechnic Institute (RPI)

2021-04-15 00:00

Table of contents

1 Quantum computing intro, 2

1 Quantum computing intro, 2

My quantum computing intro.
1. try again to watch a few videos.
  1. Visualization of superposition | QuTech Academy (0:44)
  2. David Deutsch - Why is the Quantum so Strange? (8:43)
  3. Quantum Computing for Computer Scientists (1:28:30), the 1st 1/2 hour.
  4. Quantum Algorithms (2:52)
    
    Which problems can quantum computers solve exponentially faster than classical computers? David Gosset, IBM quantum computing research scientist, explains why algorithms are key to finding out.
  5. Quantum Computing 2020 Update 11:58.
2. then starting from 4 Algorithm details in my summary.

PAR Class 21, Mon 2021-04-12

W Randolph Franklin (WRF), Rensselaer Polytechnic Institute (RPI)

2021-04-12 00:00

Table of contents

1 Final projects

See the syllabus.
April 19: team and title, and 1-2 minute proposal presentation to class.
April 26: 100 word project report.
April 26, 29, or May 3: 10-15 minute presentations. Email me with preferred dates.
May 5: report due.

2 Thrust, concl

Universal_vector uses unified memory. No need for host vs device. E.g.,

/parclass/2021/files/thrust/rpi/tiled_range2u.cu vs tiled_range2.cu
This might have a 2x performance penalty. Dunno.
You can force a function to execute on the host, or on the device, with an extra 1st argument.
There are compiler switches to define what the host or device should be. Common devices are host single thread, host OpenMP, host TBB, GPU. You could conceivably add your own.

The theory is that no source code changes are required.

3 Several parallel Nvidia tools

OpenMP, OpenACC, Thrust, CUDA, C++17, ...
https://developer.nvidia.com/blog/accelerating-standard-c-with-gpus-using-stdpar/
https://developer.nvidia.com/hpc-compilers
There are other interesting pages.
To the first order, they have similar performance. That's dominated by data transfer, which they do the same.
C++17 may have the greatest potential, but I don't know whether it's mature enough yet.

4 Nvidia GTC

This week, free. https://www.nvidia.com/en-us/gtc/
Jensen Huang's keynote should be good.
Browse around.

5 My parallel research

I use thrust (and OpenMP) to for some geometry (CAD or GIS) research doing efficient algorithms on large datasets.
One example is to process a set of 3D points to find all pairs closer than a given distance.
Another is to overlay two polyhedra or triangulations to find all intersecting pieces. That uses big rational arithmetric to avoid roundoff errors and Simulation of Simplicity to handle degenerate cases.

6 Quantum computing intro, 1

Quantum Ideas Virtual Summer School at Duke.
My quantum computing intro.
Random news
1. Press release: https://www.techerati.com/news-hub/quantum-brilliance-unveils-server-sized-quantum-computer/

PAR Class 20, Mon 2021-04-05

W Randolph Franklin (WRF), Rensselaer Polytechnic Institute (RPI)

2021-04-05 00:00

Table of contents

1 Thrust, ctd

Examples
1. bucket_sort2d.cu overlays a grid on a set of 2D points and finds the points in each grid cell (bucket).
  1. The tuple is an efficient class for a short vector of fixed length.
  2. Note how random numbers are generated. You combine an engine that produces random output with a distribution.
    
    However you might need more complicated coding to make the numbers good when executing in parallel. See monte_carlo_disjoint_sequences.cu.
  3. The problem is that the number of points in each cell is unpredictable.
  4. The cell containing each point is computed and that and the points are sorted to bring together the points in each cell.
  5. Then lower_bound and upper_bound are used to find each bucket in that sorted vector of points.
  6. See this lower_bound description.
2. mode.cu shows:
  1. Counting the number of unique keys in a vector.
    1. Sort the vector.
    2. Do an inner_product. However, instead of the operators being times and plus, they are not equal to the next element and plus.
  2. Counting their multiplicity.
    1. Construct vectors, sized at the number of unique keys, to hold the unique keys and counts.
    2. Do a reduce_by_keys on a constant_iterator using the sorted vector as the keys. For each range of identical keys, it sums the constant_iterator. That is, it counts the number of identical keys.
    3. Write a vector of unique keys and a vector of the counts.
  3. Finding the most used key (the mode).
    1. Do max_element on the counts vector.
3. repeated_range.cu repeats each element of an N-vector K times: repeated_range([0, 1, 2, 3], 2) -> [0, 0, 1, 1, 2, 2, 3, 3]. It's a lite version of expand.cu, but uses a different technique.
  1. Here, N=4 and K=2.
  2. The idea is to construct a new iterator, repeated_range, that, when read and incremented, will return the proper output elements.
  3. The construction stores the relevant info in structure components of the variable.
  4. Treating its value like a subscript in the range [0,N*K), it divides that value by K and returns that element of its input.
  See also strided_range.cu and tiled_range.cu.

1.1 Universal vector; Backends

The Thrust device can be CUDA, OpenMP, TBB, etc.
You can spec it in 2 ways:
1. by adding an extra arg at the start of a function's arg list.
2. with an envar
  
  https://github.com/thrust/thrust/wiki/Host-Backends
  
  https://github.com/thrust/thrust/wiki/Device-Backends
Universal_vector uses unified memory. No need for host vs device. E.g.,

/parclass/2021/files/thrust/rpi/tiled_range2u.cu vs tiled_range2.cu
This might have a 2x performance penalty. Dunno.
You can force a function to execute on the host, or on the device, with an extra 1st argument.
There are compiler switches to define what the host or device should be. Common devices are host single thread, host OpenMP, host TBB, GPU. You could conceivably add your own.

The theory is that no source code changes are required.

2 Linux HMM (Heterogeneous Memory Management)

This is hardware support to make programming devices like GPUs easier. It took years to get into the official kernel, presumably because you don't want bugs in memory management.

This is also a nice intro into current operating systems concerns. Do our operating systems courses talk about this?

3 Several parallel Nvidia tools

OpenMP, OpenACC, Thrust, CUDA, C++17, ...
https://developer.nvidia.com/blog/accelerating-standard-c-with-gpus-using-stdpar/
https://developer.nvidia.com/hpc-compilers
There are other interesting pages.
To the first order, they have similar performance. That's dominated by data transfer, which they do the same.
Parallel STL (Standard Template Lib) incorporated a lot of Thrust.
C++17 may have the greatest potential since it incorporates other ideas.

4 Nvidia GTC

This week, free. https://www.nvidia.com/en-us/gtc/
Jensen Huang's keynote should be good.
Browse around.

5 My parallel research

I use thrust (and OpenMP) to for some geometry (CAD or GIS) research doing efficient algorithms on large datasets.
One example is to process a set of 3D points to find all pairs closer than a given distance.
Another is to overlay two polyhedra or triangulations to find all intersecting pieces. That uses big rational arithmetric to avoid roundoff errors and Simulation of Simplicity to handle degenerate cases.

PAR Class 19, Thurs 2021-04-01

W Randolph Franklin (WRF), Rensselaer Polytechnic Institute (RPI)

2021-04-01 00:00

Table of contents

1 Day off

Next Thurs Apr 8 is ECSE Wellness Day. There will be no class.

2 Thrust, ctd

2.1 Examples

The point of these examples is to show you some techniques for parallel programming that are not obvious. I.e., you probably wouldn't think of them if you did not already know them.

I rewrote thrust/examples-1.8/tiled_range.cu into thrust/rpi/tiled_range2.cu .

It is now much shorter and much clearer. All the work is done here: ..

gather(make_transform_iterator(make_counting_iterator(0), _1%N), make_transform_iterator(make_counting_iterator(N*C), _1%N), data.begin(), V.begin());
1. make_counting_iterator(0) returns pointers to the sequence 0, 1, 2, ...
2. _1%N is a function computing modulo N.
3. make_transform_iterator(make_counting_iterator(0), _1%N) returns pointers to the sequence 0%N, 1%N, ...
4. gather populates V. The i-th element of V gets make_transform_iterator...+i element of data, i.e., the i%N-th element of data.
tiled_range3.cu is even shorter. Instead of writing an output vector, it constructs an iterator for a virtual output vector: ..

auto output=make_permutation_iterator(data, make_transform_iterator(make_counting_iterator(0), _1%N));
1. *(output+i) is *(data+(i%N)).
2. You can get as many tiles as you want by iterating.
3. tiled_range3.cu also constructs an iterator for a virtual input vector (in this case a vector of squares) instead of storing the data:
auto data = make_transform_iterator(make_counting_iterator(0), _1*_1);
tiled_range5.cu shows how to use a lambda instead of the _1 notation:

auto output=make_permutation_iterator(data, make_transform_iterator(make_counting_iterator(0), [](const int i){return i%N;} ));
1. You have to compile with --std c++11 .
2. This can be rewritten thus: ..
  
  auto f = [](const int i){return i%N;}; auto output = make_permutation_iterator(data, make_transform_iterator(make_counting_iterator(0), f));
3. The shortest lambda is this:
  
  auto f = [](){};
repeated_range2.cu is my improvement on repeated_range.cu: ..

auto output=make_permutation_iterator(data.begin(), make_transform_iterator(make_counting_iterator(0), _1/3));
1. make_transform_iterator(make_counting_iterator(0), _1/3)) returns pointers to the sequence 0,0,0,1,1,1,2,2,2, ...
Unmodified thrust examples:
1. expand.cu takes a vector like V= [0, 10, 20, 30, 40] and a vector of repetition counts, like C= [2, 1, 0, 3, 1]. Expand repeats each element of V the appropriate number of times, giving [0, 0, 10, 30, 30, 30, 40]. The process is as follows.
  1. Since the output vector will be longer than the input, the main program computes the output size, by reduce summing C, and constructs a vector to hold the output.
  2. Exclusive_scan C to obtain output offsets for each input element: C2 = [0, 2, 3, 3, 6].
  3. Scatter_if the nonzero counts into their corresponding output positions. A counting iterator, [0, 1, 2, 3, 4] is mapped with C2, using C as the stencil, giving C3 = [0, 0, 1, 3, 0, 0, 4].
  4. An inclusive_scan with max fills in the holes in C3, to give C4 = [0, 0, 1, 3, 3, 3, 4].
  5. Gather uses C4 to gather elements of V: [0, 0, 10, 30, 30, 30, 40].
2. set_operations.cu. This shows methods of handling an operation whose output is of unpredictable size. The question is, is space or time more important?
  1. If the maximum possible output size is reasonable, then construct an output vector of that size, use it, and then erase it down to its actual size.
  2. Or, run the operation twice. The 1st time, write to a discard_iterator, and remember only the size of the written data. Then, construct an output vector of exactly the right size, and run the operation again.
    
    I use this technique a lot with ragged arrays in sequential programs.
3. sparse_vector.cu represents and sums sparse vectors.
  1. A sparse vector has mostly 0s.
  2. The representation is a vector of element indices and another vector of values.
  3. Adding two sparse vectors goes as follows.
    1. Allocate temporary index and element vectors of the max possible size (the sum of the sizes of the two inputs).
    2. Catenate the input vectors.
    3. Sort by index.
    4. Find the number of unique indices by applying inner_product with addition and not-equal-to-next-element to the indices, then adding one.
      
      E.g., applied to these indices: [0, 3, 3, 4, 5, 5, 5, 8], it gives 5.
    5. Allocate exactly enough space for the output.
    6. Apply reduce_by_key to the indices and elements to add elements with the same keys.
      
      The size of the output is the number of unique keys.
What's the best way to sort 16000 sets of 1000 numbers each? E.g., sort the rows of a 16000x1000 array? On geoxeon, /pc/thrust/rpi/comparesorts.cu, which I copied from http://stackoverflow.com/questions/28150098/how-to-use-thrust-to-sort-the-rows-of-a-matrix|stackoverflow, compares three methods.
Call the thrust sort 16000 times, once per set. That took 10 secs.
Sort the whole list of 16,000,000 numbers together. Then sort it again by key, with the keys being the set number, to bring the elements of each set together. Since the sort is stable, this maintains the order within each set. (This is also how radix sort works.) That took 0.04 secs.
Call a thrust function (to sort a set) within another thrust function (that applies to each set). This is new in Thrust 1.8. That took 0.3 secs.

This is a surprising and useful paradigm. It works because

There's an overhead to starting each thrust function, and
Radix sort, which thrust uses for ints, takes linear time.
The Thrust examples teach several non-intuitive paradigms. As I figure them out, I'll describe a few. My descriptions are modified and expanded versions of the comments in the programs. This is not a list of all the useful programs, but only of some where I am adding to their comments.
1. arbitrary_transformation.cu and dot_products_with_zip.cu. show the very useful zip_iterator. Using it is a 2-step process.
  1. Combine the separate iterators into a tuple.
  2. Construct a zip iterator from the tuple.
  Note that operator() is now a template.
2. boundingbox.cu finds the bounding box around a set of 2D points.
  
  The main idea is to do a reduce. However, the combining operation, instead of addition, is to combine two bounding boxes to find the box around them.
  
  The combining op can be any associative op.

PAR Homework 7, due Thu 2021-04-12, noon

W Randolph Franklin (WRF), Rensselaer Polytechnic Institute (RPI)

2021-04-01 00:00

Play with Thrust, lambdas, and placeholder notation

Modify some programs in /parclass/2021/files/thrust/nvidia-thrust/examples to use lambdas and placeholder notation. Good choices might be saxpy, repeated_range, bounding_box, or simple_moving_average.

PAR Class 18, Mon 2021-03-29

W Randolph Franklin (WRF), Rensselaer Polytechnic Institute (RPI)

2021-03-29 00:00

Table of contents

1 Optional day off

Would the class like a day off in the next week or two?

2 Thrust

Stanford's parallel course notes.

Starting at Lecture 8, slide 22.

Lectures 9 and later will be left for you to browse if you're interested.
IMO the syntax for the zip iterator could have been simpler. When I use it, I wrap it in a simpler interface.
Nvidia has (at least) 3 official locations for Thrust, and they don't all have the same version, and don't necessarily have docs and examples.
1. As part of CUDA, /local/cuda/targets/x86_64-linux/include/thrust/
2. As part of the HPC.
3. In the github repository, https://github.com/NVIDIA/thrust
The most comprehensive doc is online at https://docs.nvidia.com/cuda/thrust/index.html

This is up-to-date, and precise. However, it is only a summary.
There is also http://thrust.github.io/doc/index.html , but it is badly written and slightly obsolete.
There are easy-to-read various tutorials online. However they are mostly obsolescent. E.g., they don't use C++-11 lambdas, which are a big help.
Also, none of the non-Nvidia docs mention the recent unified memory additions.
Look at some Thrust programs and documentation in 2021/files/thrust/
There are other alternatives like Kokkos.
The alternatives are lower-level (= faster and harder to use) and newer (= possibly less debugged, fewer users).
For awhile it looked like Nvidia had stopped developing Thrust, but they've started up again. Good.
On parallel in 2021/files/thrust/ are many little demo programs from the thrust distribution, with my additions.
Thrust is fast because the functions that look like they would need linear time really take only log time in parallel.
In functions like reduce and transform, you often see an argument like thrust::multiplies<float>(). The syntax is as follows:
1. thrust::multiplies<float> is a class.
2. It overloads operator().
3. However, in the call to reduce, thrust::multiplies<float>() is calling the default constructor to construct a variable of class thrust::multiplies<float>, and passing it to reduce.
4. reduce will treat its argument as a function name and call it with an argument, triggering operator().
5. You may also create your own variable of that class, e.g., thrust::multiplies<float> foo. Then you just say foo in the argument list, not foo().
6. The optimizing compiler will replace the operator() function call with the defining expression and then continue optimizing. So, there is no overhead, unlike if you passed in a pointer to a function.
Sometimes, e.g., in saxpy.cu, you see saxpy_functor(A).
1. The class saxpy_functor has a constructor taking one argument.
2. saxpy_functor(A) constructs and returns a variable of class saxpy_functor and stores A in the variable.
3. The class also overloads operator().
4. (Let's call the new variable foo). foo() calls operator() for foo; its execution uses the stored A.
5. Effectively, we did a closure of saxpy_functor; this is, we bound a property and returned a new, more restricted, variable or class.

2.1 Bug

I found and reported a bug in version 100904. This version does not work with OpenMP. It was immediately closed because they already knew about it. They just hadn't told us users.

Awhile ago I reported a bug in nvcc, where it went into an infinite loop for a certain array size. The next minor version of CUDA was released the next day.

Two observations:

I'm good at breaking SW by expecting it to meet its specs.
Nvidia is responsive, which I like.

2.2 CUDACast videos on Thrust

I won't cover these in class; they're presented in case you're interested.

2.3 Alternate docs in parallel-class/2021/files/thrust/doc

An_Introduction_To_Thrust.pdf
GTC_2010_Part_2_Thrust_By_Example.pdf

We'll look at this starting at slide 27. It shows parallel programming paradigms.

PAR Class 17, Thurs 2021-03-25

W Randolph Franklin (WRF), Rensselaer Polytechnic Institute (RPI)

2021-03-25 00:00

Table of contents

1 Student talks

Connor. cloud-based/docker-base parallel computing.
Isaac. Aerospace.
Jack. Nvidia's autonomous machines.

2 Several forms of C++ functions

Traditional top level function

auto add(int a, int b) { return a+b;}

You can pass this to a function. This really passes a pointer to the function. It doesn't optimize across the call.
Overload operator() in a new class

Each different variable of the class is a different function. The function can use the variable's value. This is a closure.

This is local to the containing block.

This form optimizes well.
Lambda, or anon function.

auto add = [](int a, int b) { return a+b;};

This is local to the containing block.

This form optimizes well.
Placeholder notation.

As an argument in, e.g., transform, you can do this:

transform(..., _1+_2);

This is nice and short.

As this is implemented by overloading the operators, the syntax of the expression is limited to what was overloaded.

3 Thrust

Thrust is an API that looks like STL. Its backend can be GPU-CUDA, OpenMP, TBB, or sequential host-based code.
Functional-programming philosophy.
Easier programming, once you get used to it.
Code is efficient.
Uses some unusual C++ techniques, like overloading operator().
Since the Stanford slides were created, Thrust has adopted unified addressing, so that pointers know whether they are host or device.
Stanford's parallel course notes.

We'll start lecture 8, which has a lot of content. Today we did up thru slide 24.

PAR Class 16, Mon 2021-03-22

W Randolph Franklin (WRF), Rensselaer Polytechnic Institute (RPI)

2021-03-22 00:00

Table of contents

1 Student talks

1.1 Mon

Joseph. OpenCL.
Blaine. Kubernetes.
Dan. Nsight.
Yunshi. Nvidia debugging.
Justin and Mark. HPC++, parallel facilities in current C++.
Ben. docker.

2 Thrust

Stanford's parallel course notes.

We'll finish lecture 7. We've seen, importantly, optimizing techniques and parallel paradigms. Then we'll see Thrust. 7 presents segmented scan, which is quite useful.

PAR Class 15, Thurs 2021-03-18

W Randolph Franklin (WRF), Rensselaer Polytechnic Institute (RPI)

2021-03-18 00:00

Table of contents

1 Student talks

1.1 Mon

Connor. cloud-based/docker-base parallel computing
Joseph. OpenCL.
Blaine. Kubernetes.
Dan. Nsight.
Junshi. Nvidia debugging.

1.2 either

Ben. docker.

1.3 Did not reply

You know who you are.

2 Nvidia GPU and accelerated computing, 9

This is from https://developer.nvidia.com/teaching-kits-downloads This material accompanies Programming Massively Parallel Processors A Hands-on Approach, Third Edition, David B. Kirk Wen-mei W. Hwu. I recommend it. (The slides etc are free but the book isn't.)

My local copy of what I'm using is in /parclass/2021/files/nvidia/GPU-Teaching-Kit-Accelerated-Computing.zip

Today we'll start from Module_20_Related_Programming_Models_OpenCL/

3 OpenCL

Module 20.
Apple's competition to CUDA.
is largely CUDA but they changed the names and made it look like OpenGL and clunkier.
not interesting so long as Nvidia is dominant.

4 Nvidia primary documentation

Generally more up-to-date and accurate, but drier. A little disorganized because it keeps growing. The root is here: https://docs.nvidia.com/

Two major relevant sets are:

5 Other Nvidia features

We've seen almost everything, except:

Texture and surface maps.
ML HW like A=BC+D for 4x4 matrices.
Ray tracing HW, to compute a ray's intersections with boxes.
Cooperative groups: with Ampere, subsets of a warp can synchronize.
Subsets of a GPU can be defined as virtual GPUS, which are walled off from each other.
Memory can be compressed when stored, making a space-time tradeff.
The terminology CUDA core is obsolete. Now, they say that an SM has, perhaps 32 single float units, 32 integer units, 32 CUDA instruction dispatchers, and 16 double float units, etc. Each unit operates independently.

6 Nvidia conceptual hierarchy

As always, this is as I understand it, and could be wrong. Nvidia uses their own terminology inconsistently. They may use one name for two things (E.g., Tesla and GPU), and may use two names for one thing (e.g., module and accelerator). As time progresses, they change their terminology.

At the bottom is the hardware micro-architecture. This is an API that defines things like the available operations. The last several Nvidia micro-architecture generations are, in order, Tesla (which introduced unified shaders), Fermi, Kepler, Maxwell (introduced in 2014), Pascal (2016), and Volta (2018).
Each micro-architecture is implemented in several different microprocessors. E.g., the Kepler micro-architecture is embodied in the GK107, GK110, etc. Pascal is GP104 etc. The second letter describes the micro-architecture. Different microprocessors with the same micro-architecture may have different amounts of various resources, like the number of processors and clock rate.
To be used, microprocessors are embedded in graphics cards, aka modules or accelerators, which are grouped into series such as GeForce, Quadro, etc. Confusingly, there is a Tesla computing module that may use any of the Tesla, Fermi, or Kepler micro-architectures. Two different modules using the same microprocessor may have different amounts of memory and other resources. These are the components that you buy and insert into a computer. A typical name is GeForce GTX1080.
There are many slightly different accelerators with the same architecture, but different clock speeds and memory, e.g. 1080, 1070, 1060, ...
The same accelerator may be manufactured by different vendors, as well as by Nvidia. These different versions may have slightly different parameters. Nvidia's reference version may be relatively low performance.
The term GPU sometimes refers to the microprocessor and sometimes to the module.
There are at least four families of modules: GeForce for gamers, Quadro for professionals, Tesla for computation, and Tegra for mobility.
Nvidia uses the term Tesla in two unrelated ways. It is an obsolete architecture generation and a module family.
Geoxeon has a (Maxwell) GeForce GTX Titan and a (Kepler) Tesla K20xm. Parallel has a (Volta) RTX 8000 and (Pascal) GeForce GTX 1080. We also have an unused (Kepler) Quadro K5000.
Since the highest-end (Tesla) modules don't have video out, they are also called something like compute modules.

7 GPU range of speeds

Here is an example of the wide range of Nvidia GPU speeds; all times are +-20%.

The Quadro RTX 8000 has 4608 CUDA cores @ 1.77GHz and 48GB of memory. matrixMulCUBLAS runs at 5310 GFlops. The specs claim 16 TFlops. However those numbers understate its capabilities because it also has 576 Tensor cores and 72 ray tracing cores to cast 11G rays/sec.

The GeForce GTX 1080 has 2560 CUDA cores @ 1.73GHz and 8GB of memory. matrixMulCUBLAS runs at 3136 GFlops. However the reported time (0.063 msec) is so small that it may be inaccurate. The quoted speed of the 1080 is about triple that. I'm impressed that the measured performance is so close.

The Quadro K2100M in my Lenovo W540 laptop has 576 CUDA cores @ 0.67 GHz and 2GB of memory. matrixMulCUBLAS runs at 320 GFlops. The time on the GPU was about .7 msec, and on the CPU 600 msec.

It's nice that the performance almost scaled with the number of cores and clock speed.

8 CUDA

8.1 Versions

CUDA has a capability version, whose major number corresponds to the micro-architecture generation. Kepler is 3.x. The K20xm is 3.5. The GTX 1080 is 6.1. The RTX 8000 is 7.5. Here is a table of the properties of different compute capabilities. However, that table is not completely consistent with what deviceQuery shows, e.g., the shared memory size.
nvcc, the CUDA compiler, can be told which capabilities (aka architectures) to compile for. They can be given as a real architecture, e.g., sm_61, or a virtual architecture. e.g., compute_61.
The CUDA driver and runtime also have a software version, defining things like available C++ functions. The latest is 10.1. This is unrelated to the capability version.

8.2 Misc

With CUDA, the dominant problem in program optimization is optimizing the data flow. Getting the data quickly to the cores is harder than processing it. It helps big to have regular arrays, where each core reads or writes a successive entry.

This is analogous to the hardware fact that wires are bigger (hence, more expensive) than gates.
That is the opposite optimization to OpenMP, where having different threads writing to adjacent addresses will cause the false sharing problem.
Nvidia CUDA FAQ
1. has links to other Nvidia docs.
2. can be a little old.

9 Types of memory allocation

Here's a brief overview of my understanding of the various places that you can assign memory in a program.

Static. Define a fixed-size array global array. The variable is constructed at compile time, so accesses might perhaps be faster. Global vars with non default initial values increase the executable file size. If they're large enough, you need to use the compiler option -mcmodel=medium or -mcmodel=large. They cause the compiler to generate wider addresses. I don't know the effect on the program's size or speed, but suspect that it's small.
Stack. Define local arrays, that are created and freed as the routine is entered and exited. Their addresses relative to the base of this call frame may be constant. The default stack size is 8MB. You can increase this with the command ulimit or in the program as shown in stacksize.cc. I believe that in OpenMP, the max stacksize may be allocated when each thread is created. Then, a really big stackssize might have a penalty.
Heap. You use new and destroy. Variables are constructed whenever you want. The more objects on the heap, the more time that each new or destroy takes. If you have lots of objects consider using placement new or creating an array of them.

For CUDA, some variables must be on the heap.

I like to use static, then stack, and heap only when necessary. However, allocating few but large, blocks on the heap is also fast.

Google's allocator is noticeably better than the default one. To use it, link your programs with -ltcmalloc. You can often use it on an existing executable foo thus:

LD_PRELOAD="/usr/lib/libtcmalloc.so" foo

I found it to save 15% to 30% in time.

Another memory concern is speed. Parallel has a NUMA (Non Uniform Memory Architecture). It has two 14-core Xeons. Each core has 128GB of main memory. Although all 256GB are in a common address space, accessing memory on same core as the thread is running on is faster.

The following is what I think based on some research, but may be wrong: A 4KB page of memory is assigned to a specific core when it is first written (not when it is reserved). So, each page of a large array may be on a different core. This can be used to optimize things. This gets more fun with 8-processor systems.

All that is separate from cache issues.

You can also assign your OpenMP threads to specific cores. This affects speed in ways I don't understand. The issues are resource sharing vs conflicts.

10 Nvidia GPU summary

Here's a summary of the Nvidia Pascal GP104 GPU architecture as I understand it. It's more compact than I've found elsewhere. I'll add to it from time to time. Some numbers are probably wrong.

The host is the CPU.
The device is the GPU.
The device contains 20 streaming multiprocessors (SMs).

Different GPU generations have used the terms SMX or SMM.
A thread is a sequential program with private and shared memory, program counter, etc.
Threads are grouped, 32 at a time, into warps.
Warps of threads are grouped into blocks.

Often the warps are only implicit, and we consider that the threads are grouped directly into blocks.

That abstract hides details that may be important; see below.
Blocks of threads are grouped into a grid, which is all the threads in the kernel.
A kernel is a parallel program executing on the device.
1. The kernel runs potentially thousands of threads.
2. A kernel can create other kernels and wait for their completion.
3. There may be a limit, e.g., 5 seconds, on a kernel's run time.
Thread-level resources:
1. Each thread can use up to 255 fast registers. Registers are private to the thread.
  
  All the threads in one block have their registers allocated from a fixed pool of 65536 registers. The more registers that each thread uses, the fewer warps in the block can run simultaneously.
2. Each thread has 512KB slow local memory, allocated from the global memory.
3. Local memory is used when not enough registers are available, and to store thread-local arrays.
Warp-level resources:
1. Threads are grouped, 32 at a time, into warps.
2. Each warp executes as a SIMD, with one instruction register. At each cycle, every thread in a warp is either executing the same instruction, or is disabled. If the 32 threads want to execute 32 different instructions, then they will execute one after the other, sequentially.
  
  If you read in some NVidia doc that threads in a warp run independently, then continue reading the next page to get the info mentioned in the previous paragraph.
3. If successive instructions in a warp do not depend on each other, then, if there are enough warp schedulers available, they may be executed in parallel. This is called Instruction Level Parallelism (ILP).
4. For an array in local memory, which means that each thread will have its private copy, the elements for all the threads in a warp are interleaved to potentially increase the I/O rate.
  
  Therefore your program should try to have successive threads read successive words of arrays.
5. A thread can read variables from other threads in the same warp, with the shuffle instruction. Typical operation are to read from the K-th next thread, to do a butterfly permutation, or to do an indexed read. This happens in parallel for the whole warp, and does not use shared memory.
6. A warp vote combines a bit computed by each thread to report results like all or any.
Block-level resources:
1. A block may contain up to 1024 threads.
2. Each block has access to 65536 fast 32-bit registers, for the use of its threads.
3. Each block can use up to 49152 bytes of the SM's fast shared memory. The block's shared memory is shared by all the threads in the block, but is hidden from other blocks.
  
  Shared memory is basically a user-controllable cache of some global data. The saving comes from reusing that shared data several times after you loaded it from global memory once.
  
  Shared memory is interleaved in banks so that some access patterns are faster than others.
4. Warps in a block run asynchronously and run different instructions. They are scheduled and executed as resources are available.
5. However they are all running the same instruction sequence, perhaps at different points in it.
6. That is call SPMD, single program multiple data.
7. The threads in a block can be synchonized with __syncthreads().
  
  Because of how warps are scheduled, that can be slow.
8. The threads in a block can be arranged into a 3D array, up to 1024x1024x64.
  
  That is for convenience, and does not increase performance (I think).
9. I'll talk about textures later.
Streaming Multiprocessor (SM) - level resources:
1. Each SM has 128 single-precision CUDA cores, 64 double-precision units, 32 special function units, and 32 load/store units.
2. In total, the GPU has 2560 CUDA cores.
3. A CUDA core is akin to an ALU. The cores, and all the units, are pipelined.
4. A CUDA core is much less powerful than one core of an Intel Xeon. My guess is 1/20th.
5. Beware that, in the CUDA C Programming Guide, NVidia sometimes calls an SM a core.
6. The limited number of, e.g., double precision units means that an DP instruction will need to be scheduled several times for all the threads to execute it. That's why DP is slower.
7. Each SM has 4 warp schedulers and 8 instruction dispatch units.
8. 64 warps can simultaneously reside in an SM.
9. Therefore up to 32x64=2048 threads can be executed in parallel by an SM.
10. Up to 16 blocks that can simultaneously be resident in an SM.
  
  However, if each block uses too many resources, like shared memory, then this number is reduced.
  
  Each block sits on only one SM; no block is split. However a block's warps are executed asynchronously (until synced).
11. Each SM has 64KiB (?) fast memory to be divided between shared memory and an L1 cache. Typically, 48KiB (96?) is used for the shared memory, to be divided among its resident blocks, but that can be changed.
12. The 48KB L1 cache can cache local or global memory.
13. Each SM has a read-only data cache of 48KB to cache the global constant memory.
14. Each SM has 8 texture units, and many other graphics capabilities.
15. Each SM has 256KB of L2 cache.
Grid-level resources:
1. The blocks in a grid can be arranged into a 3D array. up to \((2^{31}-1, 2^{16}-1, 2^{16}-1)\).
2. Blocks in a grid might run on different SMs.
3. Blocks in a grid are queued and executed as resources are available, in an unpredictable parallel or serial order. Therefore they should be independent of each other.
4. The number of instructions in a kernel is limited.
5. Any thread can stop the kernel by calling assert.
Device-level resources:
1. There is a large and slow 48GB global memory, which persists from kernel to kernel.
  
  Transactions to global memory are 128 bytes.
  
  Host memory can also be memory-mapped into global memory, although the I/O rate will be lower.
  
  Reading from global memory can take hundreds of cycles. A warp that does this will be paused and another warp started. Such context switching is very efficient. Therefore device throughput stays high, although there is a latency. This is called Thread Level Parallelism (TLP) and is a major reason for GPU performance.
  
  That assumes that an SM has enough active warps that there is always another warp available for execution. That is a reason for having warps that do not use all the resources (registers etc) that they're allowed to.
2. There is a 2MB L2 cache, for sharing data between SMs.
3. There is a 64KiB Small and fast global constant memory, , which also persists from kernel to kernel. It is implemented as a piece of the global memory, made fast with caches.
  
  (Again, I'm still resolving this apparent contradiction).
4. Grid Management Unit (GMU) schedules (pauses, executes, etc) grids on the device. This is more important because grids can start other grids (Dynamic Parallelism).
5. Hyper-Q: 32 simultaneous CPU tasks can launch kernels into the queue; they don't block each other. If one kernel is waiting, another runs.
6. CUDA Work Distributor (CWD) dispatches 32 active grids at a time to the SMs. There may be 1000s of grids queued and waiting.
7. GPU Direct: Other devices can DMA the GPU memory.
8. The base clock is 1607MHz.
9. GFLOPS: 8873.
10. Memory bandwidth: 320GB/s
GPU-level resources:
1. Being a Geforce product, there are many graphics facilities that we're not using.
2. There are 4 Graphics processing clusters (GPCs) to do graphics stuff.
3. Several perspective projections can be computed in parallel, for systems with several displays.
4. There's HW for texture processing.
Generational changes:
1. With each new version, Nvidia tweaks the numbers. Some get higher, others get lower.
  1. E.g., Maxwell had little HW for double precision, and so that was slow.
  2. Pascal's clock speed is much higher.
Refs:
1. The CUDA program deviceDrv.
2. http://developer.download.nvidia.com/compute/cuda/compute-docs/cuda-performance-report.pdf
3. http://international.download.nvidia.com/geforce-com/international/pdfs/GeForce_GTX_1080_Whitepaper_FINAL.pdf
4. Better Performance at Lower Occupancy, Vasily Volkov, UC Berkeley, 2010.
5. https://www.pgroup.com/lit/articles/insider/v2n1a5.htm - well written but old.
(I'll keep adding to this. Suggestions are welcome.)

11 More CUDA

CUDA function qualifiers:
1. __global__ device function called from host, starting a kernel.
2. __device__ device function called from device function.
3. __host__ (default) host function called from host function.
CUDA variable qualifiers:
1. __shared__
2. __device__ global
3. __device__ __managed__ automatically paged between host and device.
4. __constant__
5. (nothing) register if scalar, or local if array or if no more registers available.
If installing CUDA on your machine, this repository seems best:

http://developer.download.nvidia.com/compute/cuda/repos/ubuntu1604/x86_64

That includes the Thrust headers but not example programs.

12 Unionfs: Linux trick of the day

aka overlay FS, translucent FS.
If a, b are directories, and m is an empty directory, then

unionfs -o cow a=RW:b m

makes m to be a combo of a and b, with a being higher priority
Writing a file into m writes it in a.
Changing a file in b writes the new version into a
Deleting a file in b causes a white-out note to be stored in a.
Unmount it thus:

fusermount -u m
None of this requires superuser.
Application: making a read-only directory into a read-write directory.
Note: IBM had a commercial version of this idea in its CP/CMS OS in the 1960s.

13 Thrust

Stanford's parallel course notes. We did up thru the lecture 6 and parts of 7 and 8.

PAR Class 14, Mon 2021-03-15

W Randolph Franklin (WRF), Rensselaer Polytechnic Institute (RPI)

2021-03-15 00:00

Table of contents

1 Nvidia GPU and accelerated computing, 8

1 Nvidia GPU and accelerated computing, 8

This is from https://developer.nvidia.com/teaching-kits-downloads This material accompanies Programming Massively Parallel Processors A Hands-on Approach, Third Edition, David B. Kirk Wen-mei W. Hwu. I recommend it. (The slides etc are free but the book isn't.)

My local copy of what I'm using is in /parclass/2021/files/nvidia/GPU-Teaching-Kit-Accelerated-Computing.zip

Today we'll start from Module_16_Case_Study_Electrostatic_Potential_Calculation/Slides/Lecture-16-1-VMD-case-study-Part1.pdf

PAR Class 13, Thurs 2021-03-11

W Randolph Franklin (WRF), Rensselaer Polytechnic Institute (RPI)

2021-03-11 00:00

Table of contents

1 Nvidia GPU and accelerated computing, 7

1 Nvidia GPU and accelerated computing, 7

This is from https://developer.nvidia.com/teaching-kits-downloads

My local copy of what I'm using is in /parclass/2021/files/nvidia/GPU-Teaching-Kit-Accelerated-Computing.zip

Today we'll start from GPU-Teaching-Kit-Accelerated-Computing/Module_10_Parallel_Computation_Patterns_Scan/Slides/Lecture-10-2-work-inefficient-scan-kernel.pdf

This introduces some common parallel programming paradigms. They are useful with any parallel computing platform.

We did up thru 15-2.

PAR Class 12, Mon 2021-03-08

W Randolph Franklin (WRF), Rensselaer Polytechnic Institute (RPI)

2021-03-08 00:00

Table of contents

1 Types of virtualization

There are many possible levels of virtualization.
At a low level, one might emulate the HW. This is quite flexible but too slow.
At a higher level, a basic OS runs separate virtual machines, each with its own file system.
1. Harmless machine instructions execute normally.
2. Powerful ones are trapped and emulated.
3. This requires a properly designed instruction set.
4. IBM has been doing this commercially for 40 years, with something originally called CP/CMS.
5. I think that IBM lucked out with their instruction set design, and didn't plan it.
6. Well-behaved clients might have problematic code edited before running, to speed the execution.
7. I think that Vmware does that.
8. It seems that compute-intensive clients might have almost no overhead.
9. However, the emulated file system can be pretty bad.
10. With Vmware, several clients can all be different OSs, and the host can be any compatible OS.
11. E.g., I've had a linux vmware host simultaneously running both linux and windows clients.
12. SFAIK, Vmware currently doesn't run on Ubuntu because of recent linux security upgrades requiring that new modules installed in the kernel be signed.
13. In linux, root no longer has infinite power.
The next level of virtualization has an nontrivial host OS, but separates the clients from each other.
1. They see a private view of the process space, file system, and other resources.
2. This is lighter weight, e.g., quicker to start a VM and less overhead.
3. The host and client must be the same OS.
4. This might be called paravirtualization.
5. Linux supports this with things like namespace isolation and control groups (cgroups). Wikipedia et al describe this.
The next level up is the normal linux security.
1. You can see all the processes and similar resources.
2. The file system has the usual protections.
3. This is hard to make secure when doing something complicated.
4. How do I protect myself from firefox going bad?
5. It's easy to describe what it should be allowed to do, but almost impossible to implement.
6. That includes using apparmor etc.
In theory, packaging an app in a virtual machine has fewer dependencies and is more secure.
You can run the vm w/o changes on different hosts.
A Vmware client can run w/o change on both linux and windows hosts.
You can run a client on your own hardware, then spill over to a commercial cloudy platform when necessary.

2 Docker

Docker is a popular lightweight virtualization system, which Nvidia uses to distribute SW.
Docker runs images that define virtual machines.
Docker images share resources with the host, in a controlled manner.
For simple images, which is not nvidia/cuda, starting the image is so cheap that you can do it to run one command, and encapsulate the whole process in a shell function.
Docker is worth learning, apart from its use by Nvidia for parallel computing. You might also look up Kubernetes.
More info:
I installed docker on parallel to run nvidia images like pgc++. Then I removed it because it wasn't necessary, was complicated, and it was insecure.

3 Nvidia GPU and accelerated computing, 6

This is from https://developer.nvidia.com/teaching-kits-downloads

My local copy of what I'm using is in /parclass/2021/files/nvidia/GPU-Teaching-Kit-Accelerated-Computing.zip

Today we'll start from Module_9_Parallel_Computation_Patterns_Reduction/Slides/Lecture-9-1-reduction.pdf

This introduces some common parallel programming paradigms.

Today we did thru 10-2.

PAR Homework 6, due Thu 2021-03-18, noon

W Randolph Franklin (WRF), Rensselaer Polytechnic Institute (RPI)

2021-03-08 00:00

Student talks - 2nd set

Rules:
1. Two students may work together to present one talk.
2. Prepare and give 5-10 minute talk on Thurs 3/18, Mon 3/22, or Thurs 3/25 (less preferable). A team would jointly give one 5-10 minute talk. I'll pack six talks per day, filling the earliest days first, and fill out unused time with new material.
3. Email me with your team, topic, and preferred dates.
Topic:
1. In class, we're covering several parallel SW tools: OpenMP, OpenACC, CUDA, Thrust.
2. In your first set of talks, you also mentioned some parallel tools like MPI and pthreads.
3. Now, you research and present another tool useful in parallel work, either HW or SW. It might be:
  1. something from an energy lab, like Kokkos.
  2. something cloud- or docker-based.
  3. parallel facilities in current C++.
  4. a commercial product.
  5. OpenCL.
  6. a debugging tool, e.g., from Nvidia, which I might have mentioned but not described in detail.
  7. something mentioned in a recent GPU Technology Conference (GTC) or Supercomputing Conference talk.

PAR Class 11, Thu 2021-03-04

W Randolph Franklin (WRF), Rensselaer Polytechnic Institute (RPI)

2021-03-04 00:00

Table of contents

1 Nvidia GPU and accelerated computing, 5

This is from https://developer.nvidia.com/teaching-kits-downloads

My local copy of what I'm using is in /parclass/2021/files/nvidia/GPU-Teaching-Kit-Accelerated-Computing.zip

Today we'll start from Module_6_Memory_Access_Performance/Slides/Lecture-6-2-memory-coalescing.pdf

We did through 8-4.

2 More info on current Nvidia architecture

We'll cover this in detail later.

https://www.nextplatform.com/2020/05/28/diving-deep-into-the-nvidia-ampere-gpu-architecture/

https://www.nvidia.com/content/dam/en-zz/Solutions/Data-Center/nvidia-ampere-architecture-whitepaper.pdf

PAR Homework 5, due Thu 2021-03-11, noon

W Randolph Franklin (WRF), Rensselaer Polytechnic Institute (RPI)

2021-03-04 00:00

Teams of 2 are ok.

Use the ideas we've seen to implement a parallel text histogram program, both on the multicore CPU and on the manycore GPU. Do it with and without private subtotal arrays.

Use current tools, like managed memory on the device.

Compare the performance processing a big file. Separate the time to read the data into memory from the time to process it.

Try to use debugging tools to find hotspots.

To reduce file reading time, before starting your program, put the data file into the in-core filesystem in /dev/shm/, aka /ds .

PAR Class 10, Mon 2021-03-01

W Randolph Franklin (WRF), Rensselaer Polytechnic Institute (RPI)

2021-03-01 00:00

Table of contents

1 Nvidia GPU and accelerated computing, 4

1 Nvidia GPU and accelerated computing, 4

This is from https://developer.nvidia.com/teaching-kits-downloads

My local copy of what I'm using is in /parclass/2021/files/nvidia/GPU-Teaching-Kit-Accelerated-Computing.zip

Today we'll start from Module_2_Introduction_to_CUDA_C/Slides/Lecture-3-5-transparent-scaling.pdf

We did up through 6.1.

PAR Class 9, Thu 2021-02-25

W Randolph Franklin (WRF), Rensselaer Polytechnic Institute (RPI)

2021-02-25 00:00

Table of contents

1 Nvidia GPU and accelerated computing, 3

1 Nvidia GPU and accelerated computing, 3

This is from https://developer.nvidia.com/teaching-kits-downloads

My local copy of what I'm using is in /parclass/2021/files/nvidia/GPU-Teaching-Kit-Accelerated-Computing.zip

Today we'll start from Module_2_Introduction_to_CUDA_C/Slides/Lecture-2-3-cuda-parallelism-threads.pdf

PAR Class 8, Mon 2021-02-22

W Randolph Franklin (WRF), Rensselaer Polytechnic Institute (RPI)

2021-02-22 00:00

Table of contents

1 Compilers on parallel.ecse

To access them, source /opt/nvidia/initenvs

2 Nvidia GPU and accelerated computing, 2

This is from https://developer.nvidia.com/teaching-kits-downloads

My local copy of what I'm using is in /parclass/2021/files/nvidia/GPU-Teaching-Kit-Accelerated-Computing.zip

Today we'll start from Module_2_Introduction_to_CUDA_C/Slides/Lecture-2-4-cuda-toolkit.pdf .

We'll also look at programs in LabsandSolutions/ . We'll see debugging tools like cuda-gdb and cuda-memcheck.

PAR Class 7, Thu 2021-02-18

W Randolph Franklin (WRF), Rensselaer Polytechnic Institute (RPI)

2021-02-18 00:00

Table of contents

1 Link to lecture videos

is now in the top menu bar.

2 How I show files from parallel.ecse to the class

Especially graphics stuff like PDFs. The problem is that graphics over ssh is slow.

/parclass is a git repository. There's a copy on my local laptop. I display from that.

3 OpenACC concluded

I recommend OpenACC for Programmers: Concepts and Strategies By Sunita Chandrasekaran, Guido Juckeland. Informit, Amazon. Github.

Chapter 4 is online for free as PDF and html.

We'll try a program. Note how using

export PGI_ACC_TIME=1

causes timing info to be printed.

4 Nvidia GPU and accelerated computing

This is from https://developer.nvidia.com/teaching-kits-downloads

My local copy of what I'm using is in /parclass/2021/files/nvidia/GPU-Teaching-Kit-Accelerated-Computing.zip

We'll spend several classes on this topic, speed reading through the slides.

Even though these slides are dated 2020, they have some obsolescent ideas, like explicitly copying data between host and device.

Also, some terminology is not standardized, and may be inconsistent. Even in Nvidia docs.

Today we did up through Module_2_Introduction_to_CUDA_C/Slides/Lecture-2-3-cuda-parallelism-threads.pdf .

5 Archivemount

This is today's programmer productivity tip, unrelated to parallel computing but generally useful.

Rather than explicitly extracting large zip archives or tarballs in order to read files in them, I use archivemount to create a virtual file system. This saves disk space. It doesn't stress git as much (fewer files). When reading, the I/O time is insignificantly increased. For some formats, you can even write. You can have more confidence that the zip file wasn't changed, than in a directory with perhaps hundreds of files.

PAR Homework 4, due Thu 2021-02-25, noon

W Randolph Franklin (WRF), Rensselaer Polytechnic Institute (RPI)

2021-02-18 00:00

Rules

Submit the answers to Gradescope.
You may do homeworks in teams of 2 students. Create a gradescope team and make one submission with both names.

Question

Download, run, and compile the sample programs in Chapter 4 of OpenACC for Programmers: Concepts and Strategies. Report your experience comparing the serial, openmp, and openacc versions.

PAR Class 6, Thu 2021-02-11

W Randolph Franklin (WRF), Rensselaer Polytechnic Institute (RPI)

2021-02-11 00:00

Table of contents

1 Compilers ctd

I installed Nvidia's compiler suite on parallel into /opt/nvidia
Make them available to you by sourcing /opt/nvidia/initenvs
These are the recommended compilers.
They work with Nvidia GPUs better.
However g++ produced faster OpenMP code for the Xeon on one example.

2 OpenACC ctd

https://www.openacc.org/events/openacc-online-course-2018 weeks 2 and 3.

They also introduce you to Nvidia.

3 Parallel.ecse available

You may use spare time on parallel.ecse for any legal ethical purpose, even unrelated to this course. No running businesses or trying to make money. No mining. However research and having fun are fine.

I was running boinc for https://milkyway.cs.rpi.edu/milkyway/ since 2018, and am user 359 in total credit.

4 New teaching tool

Today's new gimmick is mirroring my ipad to a window on my laptop, so I can use the ipad as a graphics tablet. Relevant SW: uxplay on the laptop; see https://rodrigoribeiro.site/2020/08/09/mirroring-ipad-iphone-screen-on-linux/ and any notetaking tool on the ipad.

Earlier I tried using a laptop with a touch screen. That didn't work well, apparently because of deficiencies with linux SW controlling the screen.

5 Applied EE

My home Tesla powerwalls (27KHW capacity) and solar panels (8KW peak) are finally working. Over the year, my net electrical consumption will be close to zero. I can survive perhaps a 2-day blackout. However the real reason for getting them is that I like gadgets.

6 Changes in this course from 2020 to 2021

Drop the attempt to use docker for compilers. It's, complicated, not necessary, and is a security risk.

However, docker is an important industrial tool. If anyone would like some class time on it, just ask.
Use nvc++ not pgc++. nvc++ didn't exist last year.

7 Nvidia GPU and accelerated computing

This is from https://developer.nvidia.com/teaching-kits-downloads

My local copy of what I'm using is in /parclass/2021/files/nvidia/GPU-Teaching-Kit-Accelerated-Computing.zip

PAR Homework 3, due Thu 2021-02-15, noon

W Randolph Franklin (WRF), Rensselaer Polytechnic Institute (RPI)

2021-02-07 00:00

Rules

Submit the answers to Gradescope.
You may do homeworks in teams of 2 students. Create a gradescope team and make one submission with both names.

Question

The goal is to measure whether OpenMP actually makes matrix multiplication faster, with and w/o SIMD.
You may use anything in /parallel-class/openmp that seems useful.
Write a C++ program on parallel.ecse to initialize pseudorandomly and multiply two 100x100 float matrices. One possible initialization:

a[i][j] = i*3 + (j*j)%5; b[i][j] = i*2 + (j*j)%7;
(10 points) Report the elapsed time. Include the program listing.
Add an OpenMP pragma to do the work in parallel.
(10 points) Report the elapsed time, varying the number of threads thus: 1, 2, 4, 8, 16, 32, 64.

What do you conclude?
(5 points) Repeat that two more times to see how consistent the times are.
(10 points) Modify the pragma to use SIMD.

Report the elapsed time, varying the number of threads thus: 1, 2, 4, 8, 16, 32, 64.

What do you conclude?
(10 points) Compile and run your program with two different levels of compiler optimization: O1 and O3, reporting the elapsed time. Modify your program to prevent the optimizer from optimizing the program away to nothing. E.g., print a few values.
(5 points) What do you conclude about everything?

Total: 40 pts.

PAR Class 5, Mon 2021-02-08

W Randolph Franklin (WRF), Rensselaer Polytechnic Institute (RPI)

2021-02-06 00:00

Table of contents

1 ssh, afs, zfs

I recommend that you create key pairs to connect to parallel w/o typing your password each time.
To avoid retyping your ssh private key passphrase in the same shell, do this

ssh-add
One advantage of ssh is that you can mount your parallel directory on your local machine. In the files browser on my laptop, I connect to the server sftp://parallel.ecse.rpi.edu . Any other program to mount network shares would work as well.

Where the share is actually mounted varies. On my laptop, it's here: /var/run/user/1000/gvfs/sftp:host=parallel.ecse.rpi.edu,user=wrf

1000 is my uid on my laptop. Yours is not 1000.

As with any network mount, fancy filesystem things like attributes, simultaneous reading and writing from different machines, etc., probably won't work.
With ssh, you can also copy files back and forth w/o mounting or typing passwords:
1. scp localfile parallel.ecse.rpi.edu:
2. scp -r localdir parallel.ecse.rpi.edu:
3. scp parallel.ecse.rpi.edu:remotefile .
It even does filename completion on remote files.
You can also run single commands:

ssh parallel.ecse.rpi.edu hostname
On parallel, most filesystems use zfs.
1. Files are transparently compressed, so there's no need to explicitly use gzip etc.
2. Filesystems are automatically snapshotted, so you can often recover accidentally deleted or changed files. Look under FSROOT /.zfs/snapshot/

2 Stack size

Each thread has its own stack.
Default size is small, but easy and cheap to increase.
Since pages (for stacks and everything) are not allocated until touched, reserving large blocks of VM, like for stacks, is free.
stacksize.cc - get and set stack size. Can also do ulimit -s.

3 OpenMP

Versions: OpenMP is a living project, that is being updated every few years.
1. There is a lot of free online documentation, examples, and tutorials.
2. New features are regularly added, such as support for GPU backends.
3. However the documentation and implementations lags.
4. In my course, I'll use obsolete documentation when it's good enough. It's all still correct and fine for most applications.
5. Then I'll present recent some additions.
We'll review some examples running on parallel, at /parclass2022/openmp/rpi/ .
Now that we've seen some examples, look at the LLNL tutorial. https://computing.llnl.gov/tutorials/openMP/
We saw that running even a short OpenMP program could be unpredictable. The reason for that was a big lesson.

There are no guarantees about the scheduling of the various threads. There are no guarantees about fairness of resource allocation between the threads. Perhaps thread 0 finishes before thread 1 starts. Perhaps thread 1 finishes before thread 2 starts. Perhaps thread 0 executes 3 machine instructions, then thread 1 executes 1 instruction, then thread 0 executes 100 more, etc. Each time the process runs, something different may happen. One thing may happen almost all the time, to trick you.

This happened during the first space shuttle launch in 1981. A 1-in-70 chance event prevented the computers from synchonizing. This had been observed once before during testing. However when they couldn't get it to happen again, they ignored it.

This interlacing of different threads can happen at the machine instruction level. The C statement K=K+3 can translate into the machine instructions

LOAD K; ADD 3; STORE K.

Let's color the threads red and blue.

If the threads interlace thus:

LOAD K; ADD 3; STORE K; LOAD K; ADD 3; STORE K

then K increases by 6. If they interlace thus:

LOAD K; ADD 3; LOAD K; ADD 3; STORE K; STORE K;

then K increases by 3.
This can be really hard to debug, particularly when you don't know where this is happening.

One solution is to serialize the offending code, so that only one thread executes it at a time. The limit would be to serialize the whole program and not to use parallel processing.
OpenMP has several mechanisms to help.
A critical pragma serializes the following block. There are two considerations.
1. You lose the benefits of parallelization on that block.
2. There is an overhead to set up a critical block that I estimate might be 100,000 instructions.
An atomic pragma serializes one short C statement, which must be one of a small set of allowed operations, such as increment. It matches certain atomic machine instructions, like test-and-set. The overhead is much smaller.
If every thread is summing into a common total variable, the reduction pragma causes each thread to sum into a private subtotal, and then sum the subtotals. This is very fast.
Another lesson is that sometimes you can check your program's correctness with an independent computation. For the trivial example of summing \(i^2\), use the formula

\(\sum_{i=1}^N i^2 = \frac{N(N+1)(2N+1)}{6}\).

There is a lot of useful algebra if you have the time to learn it. I, ahem, learned this formula in high school.
Another lesson is that even when the program gives the same answer every time, it might still be consistently wrong.
Another is that just including OpenMP facilities, like -fopenmp, into your program slows it down even if you don't use them.
Another is that the only meaningful time metric is elapsed real time. One reason that CPU time is meaningless is the OpenMP sometimes pauses a thread's progress by making it do a CPU-bound loop. (That is also a common technique in HW.)
Also note that CPU times can vary considerably with successive executions.
Also, using too many threads might increase the real time.
Finally, floating point numbers have their own problems. They are an approximation of the mathematical concept called the real number field. That is defined by 11 axioms that state obvious properties, like

A+B=B+A (commutativity) and

A+(B+C)=(A+B+C) (associativity).

This is covered in courses on modern algebra or abstract algebra.

The problem is that most of these are violated, at least somewhat, with floats. E.g.,

\(\left(10^{20}-10^{20}\right)+1=0+1=1\) but

\(10^{20}+\left(-10^{20}+1\right)=10^{20}-10^{20}=0\).

Therefore when threads execute in a different order, floating results might be different.

There is no perfect solution, though using double is a start.
1. On modern CPUs, double is just as fast as float.
2. However it's slower on GPUs.
  
  How much slower can vary. Nvidia's Maxwell line spent very little real estate on double precision, so it was very slow. In contrast, Maxwell added a half-precision float, for implementing neural nets. Apparently neural nets use very few significant digits.
  
  Nvidia's newer Pascal line reverted this design choice and spends more real estate on implementing double. parallel.ecse's GPU is Pascal.
  
  Nvidia's Volta generation also does doubles well;
3. It also requires moving twice the data, and data movement is often more important than CPU time.
The large field of numerical analysis is devoted to finding solutions, with more robust and stable algorithms.

Summing an array by first sorting, and then summing the absolutely smaller elements first is one technique. Inverting a matrix by pivoting on the absolutely largest element, instead of on \(a_{11}\) is another.
Another nice, albeit obsolescent and incomplete, ref: Wikibooks OpenMP.
Its tasks examples are interesting.
OpenMP tasks
1. While inside a pragma parallel, you queue up lots of tasks - they're executed as threads are available.
2. My example is tasks.cc with some variants.
3. taskfib.cc is modified from an example in the OpenMP spec.
  1. It shows how tasks can recursively spawn more tasks.
  2. This is only an example; you would never implement fibonacci that way. (The fastest way is the following closed form. In spite of the sqrts, the expression always gives an integer. )
    
    \(F_n = \frac{\left( \frac{1+\sqrt{5}}{2} \right) ^n - \left( \frac{1-\sqrt{5}}{2} \right) ^n }{\sqrt{5}}\)
  3. Spawning a task is expensive; we can calculate the cost.
omps.cc - read assorted OMP variables.
Since I/O is not thread-safe, in hello_single_barrier.cc it has to be protected by a critical and also followed by a barrier.
OpenMP sections - my example is sections.cc with some variants.
1. Note my cool way to print an expression's name followed by its value.
2. Note the 3 required levels of pragmas: parallel, sections, section.
3. The assignment of sections to threads is nondeterministic.
4. IMNSHO, OpenMP considerably easier than pthreads, fork, etc.
http://stackoverflow.com/questions/13788638/difference-between-section-and-task-openmp
OpenMP is weak on compiling for GPUs. So....

4 OpenACC

Newer than OpenMP.
More abstract; compiler tries to determine what to parallelize.
Your algorithm needs to be parallelizable. That's the hardest part.
https://www.openacc.org/
I'll walk through the tutorials from https://www.openacc.org/events/openacc-online-course-2018 .

They also introduce you to Nvidia.

Today: week 1.

5 Lots of compilers

g++
nvcc - specifically for cuda.
pgc++ - commercial. Perhaps better than g++, especially for Nvidia.
nvc++ - nvidia's takeover of pgc++. perhaps the best. I'll install it RSN.

6 Homework 3

is online here , due in a week.

7 Next topic

Nvidia

PAR Class 4, Thurs 2021-02-04

W Randolph Franklin (WRF), Rensselaer Polytechnic Institute (RPI)

2021-02-04 00:00

Table of contents

1 Thursday special lecture: Quantum Computing with Atoms

NY CREATES Emerging Technologies Seminar Series

February 4th (Thursday), 2021: 11:30 am – 12:30 pm (EST)

Zoom Pre-Registration (by Feb 3) required at:

https://ny-creates.org/news-events/ny-creates-events/

“Quantum Computing with Atoms”

by: Dr. Christopher Monroe

2 Our various computer systems don't talk to each other

Everyone in this class should--

be in gradescope,
have received an invitation to webex meet for the lectures,
be reading the messages in webex (formerly called webex teams),
be in LMS,
be reading this blog.

If you're not in gradescope or webex or webex meet, e.g. because you added late, tell us.

3 Today's student talks

Ben: computers 5&6.
Connor: machine learning

PAR Class 3, Mon 2021-02-01

W Randolph Franklin (WRF), Rensselaer Polytechnic Institute (RPI)

2021-02-01 00:00

Table of contents

1 Today's student talks

Yunshi: python
Justin: Application of Par Comp in Astronomy
Dan: physics application
Joseph: pthreads
Jack: Folding at Home
Isaac: MPI
Blaine: Matlab
Mark: top 2 supercomputers

2 Thursday special lecture: Quantum Computing with Atoms

NY CREATES Emerging Technologies Seminar Series

February 4th (Thursday), 2021: 11:30 am – 12:30 pm (EST)

Zoom Pre-Registration (by Feb 3) required at:

https://ny-creates.org/news-events/ny-creates-events/

“Quantum Computing with Atoms”

by: Dr. Christopher Monroe

Co-Founder & Chief Scientist, IonQ Inc. and Professor, Duke University Abstract: Trapped atomic ions are the unique quantum computing physical platform that features qubits with essentially infinite idle coherence times. Such atomic clock qubits are controlled with laser beams, allowing densely-connected and reconfigurable universal gate sets. Unlike all other physical platforms for quantum computing, the path to scale involves concrete architectural paths, from shuttling ions between QPU cores to modular photonic interconnects between multiple QPUs. Full-stack ion trap quantum computers have thus moved away from the physics of qubits and gates and toward the engineering of optical control signals, quantum gate compilation for algorithms, and high level system design considerations. I will summarize the state-of-the-art in these quantum computers and speculate on how they might be used.

Biography: Christopher Monroe is Professor of ECE and Physics at Duke University and the co-Founder and Chief Scientist of IonQ, Inc. Monroe is an atomic physicist and quantum engineer, specializing in the isolation of individual atoms for applications in quantum information science. At NIST in the 1990s, Monroe co-led the team that demonstrated the first quantum logic gate. At the University of Michigan and the University of Maryland, Monroe’s research group pioneered all aspects of trapped atomic ion based quantum computers, making the first steps toward a scalable, reconfigurable, and modular quantum computer system. In 2016, he co-founded IonQ, a startup company leading the way in the fabrication of full-stack quantum computers. Monroe is a member of the National Academy of Sciences and is one of the key architects of the U.S. National Quantum Initiative passed by Congress in 2018.

Although this is outside the class time, I hope you can watch it. After it's over, I'll start a webex meet to discuss it. That'll be the Thurs class.

PAR Homework 2, Mon 2021-02-01

W Randolph Franklin (WRF), Rensselaer Polytechnic Institute (RPI)

2021-02-01 00:00

Rules

Due next Mon 2021-02-08, noon.
Submit the answers to Gradescope.
You may do homeworks in teams of 2 students. Create a gradescope team and make one submission with both names.
Short but informative answers are fine.

Questions

Each is 5 points.

Why is one CUDA core less powerful than one Intel Xeon core?
If 10% of a program cannot be parallelized, then what is the max speedup even with infinite parallelism?
"For short running parallel programs, there can actually be a decrease in performance compared to a similar serial implementation.". Why?
What is the difference between strong scaling and weak scaling?
What is cache coherency?
Define "embarrassingly parallel".
How many CUDA cores does the RTX 8000 have?
Why have machine cycle speeds stopped increasing?
Give an advantage and a disadvantage of shared memory versus distributed memory for parallel computation.
In OpenMP, what's the difference between ATOMIC and CRITICAL?

Total: 50 pts.

PAR Class 2, Thu 2021-01-28

W Randolph Franklin (WRF), Rensselaer Polytechnic Institute (RPI)

2021-01-28 00:00

Table of contents

1 Webex, classes, etc

I'll continue to use Webex Meet for awhile because it lets me show the chat window simultaneously.
In Webex (formerly called Webex Teams), I have a space for people to talk.
Mediasite has recordings of classes.
My blog has transcripts and chats. See the Files tab.

2 LLNL parallel tutorial

We'll quickly finish it, starting at Parallel Programming Models. https://computing.llnl.gov/tutorials/parallel_comp/

3 Material

Think about these questions.
1. Why have machine cycle speeds stopped increasing?
2. What architectures do the top 3 machines on the Top 500 list use?
3. Which one of the following 4 choices are most GPUs: SISD, SIMD, MISD, MIMD.
4. Which one of the following 4 choices are most current multicore CPUs: SISD, SIMD, MISD, MIMD.
5. Per Amdahl's law, if a program is 10% sequential and 90% parallelizable, what is the max speed up that can be obtained with an infinite number of parallel processors?

4 Recently obsoleted parallel tech

Intel Xeon Phi
IBM BlueGene

5 My research

I do parallel geometry algorithms on large problems for CAD and GIS. See my home page. If this is interesting, talk to me. Maybe you can do something that leads to a jointly-authored paper.

6 parallel.ecse

I've set up account for you.
Your user name is your rcsid.
I'll mention your initial password. Change it.
Play with them. I recomend connecting with ssh.
From outside RPI, you need to use the vpn.
Its accounts are completely separate from RCS; I will use the same userid for convenience. Changing your password on parallel does not affect your RCS password, and vv.

7 OpenMP

https://www.openmp.org/ Browse it.
Read https://computing.llnl.gov/tutorials/openMP/
We'll see some examples running on parallel, at /parclass2021/files/openmp/rpi
We saw that running even a short OpenMP program could be unpredictable. The reason for that was a big lesson.

There are no guarantees about the scheduling of the various threads. There are no guarantees about fairness of resource allocation between the threads. Perhaps thread 0 finishes before thread 1 starts. Perhaps thread 1 finishes before thread 2 starts. Perhaps thread 0 executes 3 machine instructions, then thread 1 executes 1 instruction, then thread 0 executes 100 more, etc. Each time the process runs, something different may happen. One thing may happen almost all the time, to trick you.

PAR Homework 1, Mon 2021-01-25

W Randolph Franklin (WRF), Rensselaer Polytechnic Institute (RPI)

2021-01-25 00:00

Student talks - 1st set

(10 points) Pick two topics from this list.
1. Describe the fastest 2 computers on the top500 list https://www.top500.org/lists/top500/2020/11/
2. Describe computers 3 and 4 on top500.
3. Describe computers 5 and 6 on top500.
4. Describe the Blue Gene.
5. Summarize python's parallel capabilities.
6. Summarize Matlab's parallel capabilities.
7. Summarize Mathematica's parallel capabilities.
8. Summarize MPI.
9. Describe an application where physics needs/uses parallel computing.
10. Describe an application where astronomy needs/uses parallel computing.
11. Describe an application where biology needs/uses parallel computing.
12. Give an application where machine learning needs/uses parallel computing.
13. Give an application where computational fluid dynamics needs/uses parallel computing.
14. Summarize OpenACC.
15. Summarize pthreads.
16. Other related topic of your choice.
Email me with your choices.
I'll assign you your first topic that wasn't already taken.
Two students may work together to present one talk.
Prepare and give 5-10 minute talk next Mon 2/1. We'll spread the talks out evenly. A team would jointly give one 5-10 minute talk.

I won't block your mike at 10 minutes, but will leave it to you to be responsible. There are 9 students in class. The class is 80 minutes long.

PAR Class 1, Mon 2021-01-25

W Randolph Franklin (WRF), Rensselaer Polytechnic Institute (RPI)

2021-01-24 00:00

Table of contents

1 You

After we discuss the syllabus, tell the class about yourself for a minute or two. What would you like to see in this course.

2 Material

Read the syllabus.
Read https://computing.llnl.gov/tutorials/parallel_comp/ for an intro to parallel computing.
Some points:
1. Parallel computing is decades old; there were commercial machines in the 1980s. I directed two PhD theses in parallel geometry then. However, then, clocks speeds were increasing so serial machines were more interesting.
2. Now: physics limits to processor speed.
3. History of Nvidia.
  1. Curtis Priem had designed graphics HW for both IBM and Sun Microsystems. (For awhile Sun was THE Unix workstation company. They used open standards and had the best price / performance.)
  2. Nvidia designed gaming graphics accelerators...
  3. that just happened to be parallel coprocessors...
  4. that started to be used for nongraphics parallel processing because of their value.
  5. Nvidia noticed that and added more capability, e.g., double precision IEEE floats, to serve that market.
  6. Currently, some of the highest performance Nvidia boards cannot even do graphics because they don't have video out ports.
1. Intel CPUs vs Nvidia CUDA cores.
2. Advantages and disadvantages of shared memory.
3. OpenMP vs CUDA.
4. Rate-limiting cost usually I/O not computation.

3 Homework 1

is online, due next Mon.

PAR Syllabus

W Randolph Franklin (WRF), Rensselaer Polytechnic Institute (RPI)

2021-01-23 00:00

Table of contents

1 Catalog info

Title: ECSE-4740-01 Applied Parallel Computing for Engineers
Semesters: Spring term annually
Credits: 3 credit hours
Time and place: Mon and Thurs 12:20-1:40pm, online

2 Description

This is intended to be a computer engineering course to provide students with knowledge and hands-on experience in developing applications software for affordable parallel processors. This course will mostly cover hardware that any lab can afford to purchase. It will cover the software that, in the prof's opinion, is the most useful. There will also be some theory.
The target audiences are ECSE seniors and grads and others with comparable background who wish to develop parallel software.
This course will have minimal overlap with parallel courses in Computer Science. We will not teach the IBM BlueGene, because it is so expensive, nor cloud computing and MPI, because most big data problems are in fact small enough to fit on our hardware.
You may usefully take all the parallel courses at RPI.
This unique features of this course are as follows:
1. For 2/3 of the course, use of only affordable hardware that any lab might purchase, such as Nvidia GPUs. This is currently the most widely used and least expensive parallel platform.
2. Emphasis on learning several programming packages, at the expense of theory. However you will learn a lot about parallel architecture.
Hardware taught, with reasons:

Multicore Intel Xeon

universally available and inexpensive, comparatively easy to program, powerful

Nvidia GPU accelerator

widely available (Nvidia external graphics processors are on 1/3 of all PCs), very inexpensive, powerful, but harder to program. Good cards cost only a few hundred dollars.

Quantum computers

It's new and hot and might some day be useful.
Software that might be taught, with reasons:

OpenACC C++ extension

widely used, easy to use if your algorithm is parallelizable, backend is primarily multicore Xeon but also GPUs.

Thrust C++ functional programming library

FP is nice, hides low level details, backend can be any major parallel platform.

MATLAB, Mathematica

easy to use parallelism for operations that they have implemented in parallel, etc.

CUDA C++ extension and library for Nvidia

low level access to Nvidia GPUs.

nvc++

Nvidia's parallel C++ compiler.
The techniques learned here will also be applicable to larger parallel machines -- numbers 2 and 3 on the top 500 list use NVIDIA GPUs.
Effectively programming these processors will require in-depth knowledge about parallel programming principles, as well as the parallelism models, communication models, and resource limitations of these processors.

3 Prerequisite

ECSE-2660 CANOS or equivalent, knowledge of C++,.

4 Professor

W. Randolph Franklin. BSc (Toronto), AM, PhD (Harvard)

Office

online only

Phone

+1 (518) 276-6077 (forwards)

Email

frankwr@YOUKNOWTHEDOMAIN

Email is my preferred communication medium.

Sending from non-RPI accounts are fine, but please show your name, at least in the comment field. A subject prefix of #Prob is helpful. GPG encryption is fine.

Web

https://wrf.ecse.rpi.edu/

A quick way to get there is to google RPIWRF.

Office hours

After each online lecture, usually as long as anyone wants to talk. Also by appointment.

5 Course websites

The homepage has lecture summaries, syllabus, homeworks, etc.

6 Reading material

6.1 Text

There is no required text, but the following inexpensive books may be used. I might mention others later.

Sanders and Kandrot, CUDA by example. It gets excellent reviews, although it is several years old. Amazon has many options, including Kindle and renting hardcopies.
Kirk and Hwu, 2nd edition, Programming massively parallel processors. It concentrates on CUDA.

One problem is that even recent books may be obsolete. For instance they may ignore the recent CUDA unified memory model, which simplifies CUDA programming at a performance cost. Even if the current edition of a book was published after unified memory was released, the author might not have updated the examples.

6.2 Web

There is a lot of free material on the web, which I'll reference class by class. Because web pages are vanish so often (really!), I may cache some locally. If interested, you might start here:

https://hpc.llnl.gov/training/tutorials

7 Computer systems used

7.1 parallel.ecse

This course will primarily use (remotely via ssh) parallel.ecse.rpi.edu.

Parallel has:

dual 14-core Intel Xeon E5-2660 2.0GHz
256GB of DDR4-2133 ECC Reg memory
Nvidia GPUs:
1. Quadro RTX 8000 with 48GB memory, 16 TFLOPS, and 4608 CUDA cores. This can do real time ray tracing.
2. GeForce GTX 1080 with 8GB memory and 2569 CUDA cores.
Samsung Pro 850 1TB SSD
WD Red 6TB 6GB/s hard drive
CUDA 10.1
OpenMP 4.5
Thrust
Ubuntu 20.10

Material for the class is stored in /parallel-class/ .

7.2 Amazon EC2

We might also use parallel virtual machines on the Amazon EC2. If so, you will be expected to establish an account. I expect the usage to be in the free category.

7.3 Gradescope

Gradescope will be used for you to submit homeworks and for us to distribute grades.

7.4 WebEx

to deliver lectures, to discuss, etc.

7.5 Others

Delivering good remote courses is a work in progress. If I learn of a new tool that looks useful, I might use it.

8 Assessment measures, i.e., grades

There will be no exams.
Each student will make several in-class presentations summarizing some relevant topic.
There will be a term project.

8.1 Homeworks

There will be some homeworks.

You may do homeworks in teams of up to 3 students. Create a gradescope team and make one submission with both names.

8.2 Term project

For the latter part of the course, most of your homework time will be spent on a term project.
You are encouraged do it in teams of any size. A team of 3 people would be expected to do twice as much work as 1 person.
You may combine this with work for another course, provided that both courses know about this and agree. I always agree.
If you are a grad student, you may combine this with your research, if your prof agrees, and you tell me.
You may build on existing work, either your own or others'. You have to say what's new, and have the right to use the other work. E.g., using any GPLed code or any code on my website is automatically allowable (because of my Creative Commons licence).
You will implement, demonstrate, and document something vaguely related to parallel computing.
Deliverables:
1. An implementation showing parallel computing.
2. An extended abstract or paper on your project, written up like a paper. You should follow the style guide for some major conference (I don't care which, but can point you to one).
3. A more detailed manual, showing how to use it.
4. A 2-minute project proposal given to the class around the middle of the semester.
5. A 10-minute project presentation and demo given to the class in the last week.
6. Some progress reports.
7. A write-up uploaded on the last class day. This will contain an academic paper, code and perhaps video or user manual.
Size

It's impossible to specify how many lines of code makes a good term project. E.g., I take pride in writing code that is can be simultaneously shorter, more robust, and faster than some others. See my 8-line program for testing whether a point is in a polygon: Pnpoly.

According to Big Blues, when Bill Gates was collaborating with around 1980, he once rewrote a code fragment to be shorter. However, according to the IBM metric, number of lines of code produced, he had just caused that unit to officially do negative work.

9 Remote class procedures

Parallel will be completely virtual. I intend to lecture live, with some supplementary videos. Since this is a small class, students will be encouraged to ask questions, live and with chat during the class. I'll try to save the chat window, but it may be deleted after each class. If you cannot watch the class live, then you may miss any info presented in the chat window.

Try to share your video when talking.

I'll try to use a webex teams space for discussions between class. Operative word is TRY. We may switch to webex teams for classes.

I will maintain a class blog that briefly summarizes each class. Important points will be written down.

I will attempt to record all classes and upload them to mediasite for students to watch or download later. However, this fall, about 10% of classes failed to record properly. If this happens, then you will miss that class.

Depending on the class composition, since it increases learning, there might be graded in-class quizzes. That means, that I would like to do this. However if it causes too much trouble for too many students then I won't.

There will be no exams. Students will be expected to make a in-class presentations and answer questions, and present a final project in class. Just submitting a video for me to play is insufficient because then, how do you answer questions after your talk?

In general, if the conflicting class is not in ECSE, then I expect you to treat my class more seriously, e.g., attend it live more often.

9.1 Reporting problems during a lecture

I don't know what you actually see (even though I have a 2nd laptop showing the meeting). So, I'll give you a phone number to call me during the class to report problems.

10 Early warning system (EWS)

As required by the Provost, we may post notes about you to EWS, for example, if you're having trouble doing homeworks on time, or miss an exam. E.g., if you tell me that you had to miss a class because of family problems, then I may forward that information to the Dean of Students office.

11 Academic integrity

See the Student Handbook for the general policy. The summary is that students and faculty have to trust each other. After you graduate, your most important possession will be your reputation.

Specifics for this course are as follows.

You may collaborate on homeworks, but each team of 1 or 2 people must write up the solution separately (one writeup per team) using their own words. We willingly give hints to anyone who asks.
The penalty for two teams handing in identical work is a zero for both.
You may collaborate in teams of up to 3 people for the term project.
You may get help from anyone for the term project. You may build on a previous project, either your own or someone else's. However you must describe and acknowledge any other work you use, and have the other person's permission, which may be implicit. E.g., my web site gives a blanket permission to use it for nonprofit research or teaching. You must add something creative to the previous work. You must write up the project on your own.
However, writing assistance from the Writing Center and similar sources in allowed, if you acknowledge it.
The penalty for plagiarism is a zero grade.
Cheating will be reported to the Dean of Students Office.

12 Other rules required by RPI

You've seen them in your other classes. Assume that they're included here.

13 Student feedback

Since it's my desire to give you the best possible course in a topic I enjoy teaching, I welcome feedback during (and after) the semester. You may tell me or write me, or contact a third party, such as Prof James Lu, the ECSE undergrad head, or Prof John Wen, the ECSE Dept head.