PAR Class 26, Mon 2020-04-27

W Randolph Franklin (WRF), Rensselaer Polytechnic Institute (RPI)

2020-04-26 00:00

Table of contents

1 Final project presentations

Joe C & Alex Z
Mike M
Matt R & Skyl S
Eliz C
John F & Hayl R & Mish S
Ross D
Clar S & Garr D
Liz C
Zhep L

2 Final project and prior work

If your final project is building on, or sharing with, another course or project (say on GitHub), then you must give the details, and say what's new for this course.

3 After the semester

I'm open to questions and discussions about any legal ethical topic. Even after you graduate.

PAR Class 25, Thurs 2020-04-23

W Randolph Franklin (WRF), Rensselaer Polytechnic Institute (RPI)

2020-04-21 00:00

Table of contents

1 Final paper format

Try to use the IEEE conference format . It allows either latex or MS word. Submit the PDF paper to gradescope.

3 Inspiration for finishing your term projects

The Underhanded C Contest

"The goal of the contest is to write code that is as readable, clear, innocent and straightforward as possible, and yet it must fail to perform at its apparent function. To be more specific, it should do something subtly evil. Every year, we will propose a challenge to coders to solve a simple data processing problem, but with covert malicious behavior. Examples include miscounting votes, shaving money from financial transactions, or leaking information to an eavesdropper. The main goal, however, is to write source code that easily passes visual inspection by other programmers."
The International Obfuscated C Code Contest
https://www.awesomestories.com/asset/view/Space-Race-American-Rocket-Failures

Moral: After early disasters, sometimes you can eventually get things to work.
The 'Wrong' Brothers Aviation's Failures (1920s)
Early U.S. rocket and space launch failures and explosion
Numerous US Launch Failures

4 Software tips

4.1 Git

Git is good to simultaneously keep various versions. Here's a git intro:

Create a dir for the project:

mkdir PROJECT; cd PROJECT

Initialize:

git init

Create a branch (you can do this several times):

git branch MYBRANCHNAME

Go to a branch:

git checkout MYBRANCHNAME

Do things:

vi, make, ....

Save it:

git add .; git commit -mCOMMENT

Repeat

At times I've used this to modify a program for class while keeping a copy of the original.

4.2 Freeze decisions early: SW design paradigm

One of my rules is to push design decisions to take effect as early in the process execution as possible. Constructing variables at compile time is best, at function call time is 2nd, and on the heap is worst.

If I have to construct variables on the heap, I construct few and large variables, never many small ones.
Often I compile the max dataset size into the program, which permits constructing the arrays at compile time. Recompiling for a larger dataset is quick (unless you're using CUDA).

Accessing this type of variable uses one less level of pointer than accessing a variable on the heap. I don't know whether this is faster with a good optimizing compiler, but it's probably not slower.
If the data will require a dataset with unpredictably sized components, such as a ragged array, then I may do the following.
1. Read the data once to accumulate the necessary statistics.
2. Construct the required ragged array.
3. Reread the data and populate the array.

Update: However, with CUDA, maybe managed variables must be on the heap.

5 CPPCON

CppCon is the annual, week-long face-to-face gathering for the entire C++ community. https://cppcon.org/

https://cppcon2018.sched.com/

CppCon 2014: Herb Sutter "Paying for Lunch: C++ in the ManyCore Age" https://www.youtube.com/watch?v=AfI_0GzLWQ8

CppCon 2016: Combine Lambdas and weak_ptrs to make concurrency easy (4min) https://www.youtube.com/watch?v=fEnnmpdZllQ

A Pragmatic Introduction to Multicore Synchronization by Samy Al Bahra. https://www.youtube.com/watch?v=LX4ugnzwggg

CppCon 2018: Tsung-Wei Huang “Fast Parallel Programming using Modern C++” https://www.youtube.com/watch?v=ho9bqIJkvkc&list=PLHTh1InhhwT7GoW7bjOEe2EjyOTkm6Mgd&index=13

CppCon 2018: Anny Gakhokidze “Workflow hacks for developers” https://www.youtube.com/watch?v=K4XxeB1Duyo&list=PLHTh1InhhwT7GoW7bjOEe2EjyOTkm6Mgd&index=33

CppCon 2018: Bjarne Stroustrup “Concepts: The Future of Generic Programming (the future is here)” https://www.youtube.com/watch?v=HddFGPTAmtU

CppCon 2018: Herb Sutter “Thoughts on a more powerful and simpler C++ (5 of N)” https://www.youtube.com/watch?v=80BZxujhY38

CppCon 2018: Jefferson Amstutz “Compute More in Less Time Using C++ Simd Wrapper Libraries” https://www.youtube.com/watch?v=80BZxujhY38

CppCon 2018: Geoffrey Romer “What do you mean "thread-safe"?” https://www.youtube.com/watch?v=s5PCh_FaMfM

6 Quantum computing ctd

Go Behind-the-Scenes of a Quantum Experiment (2:10) https://quantumexperience.ng.bluemix.net/qx/community/question?questionId=5ae975690f020500399ed39a&channel=videos

or

https://www.youtube.com/watch?v=tfZpJLdkzRU&list=PLOFEBzvs-VvpzQnlazij7cL1mjKvJTAwk&index=7

A Qubit in the Making (2:01)

https://www.youtube.com/watch?v=2pB87H3_F_c&list=PLOFEBzvs-VvpzQnlazij7cL1mjKvJTAwk&index=10&t=0s

Behold the Mighty Qubit (2:51) https://www.youtube.com/watch?v=_P7K8jUbLU0&list=PLOFEBzvs-VvpzQnlazij7cL1mjKvJTAwk&index=10

Classical and Quantum Randomness (3:39) https://www.youtube.com/watch?v=8kyJfAC4VAo&list=PLOFEBzvs-VvpzQnlazij7cL1mjKvJTAwk&index=6

Quantum Entanglement (2:21) https://www.youtube.com/watch?v=RmXasxLm43k&list=PLOFEBzvs-VvpzQnlazij7cL1mjKvJTAwk&index=5

Benchmarking Quantum Systems (1:58) https://www.youtube.com/watch?v=-7L5o-mzLqU&list=PLOFEBzvs-VvpzQnlazij7cL1mjKvJTAwk&index=8

7 Supercomputing 2018 (SC18) videos

SC18 Invited Talk: Matthias Troyer https://www.youtube.com/watch?v=97s3FEhG14g (45:46)

This is very nice. I'll show the start and then a highlight, starting at 19:00.

SC18: NVIDIA CEO Jensen Huang on the New HPC (1:44:46) https://www.youtube.com/watch?v=PQbhxfRH2H4

We'll watch the wrapup,

SCinet Must See: How to Build the World’s Fastest Temporary Computer Network Time Lapse (0:57) https://sc18.supercomputing.org/scinet-must-see-how-to-build-the-worlds-fastest-temporary-computer-network-time-lapse/

8 Jack Dongarra

He has visited RPI more than once.

Algorithms for future emerging technologies (2:55:45) https://www.youtube.com/watch?v=TCgHNMezmZ8

We'll watch the start. You're encouraged to watch it all. It gets more technical later.

Stony Brook talk https://www.youtube.com/watch?v=D1hfrtoVZDo

9 Course survey

I see my role as a curator, selecting the best stuff to present to you. Since I like the topic, and asked to be given permission to create this course, I pick things I like since you might like them too.

So, if you liked the course, then please officially tell RPI by completing the survey and saying what you think. Thanks.

PAR Class 24, Mon 2020-04-20

W Randolph Franklin (WRF), Rensselaer Polytechnic Institute (RPI)

2020-04-18 00:00

Table of contents

1 Exascale computing

https://www.nextplatform.com/2019/12/04/openacc-cozies-up-to-c-c-and-fortran-standards/

https://www.nextplatform.com/2019/01/09/two-thirds-of-the-way-home-with-exascale-programming/

They agree with me that fewer bigger parallel processors are better than many smaller processors connected with MPI. I.e., they like my priorities in this course.

2 Final project presentation

Sava C & Davi P & Emil V

3 Quantum computing ctd: Shor's algorithm

Factorize an int.
in BQP.
almost exponentially faster than best classical algorithm.
When I searched for the largest example, I found several inconsistent announcements in the last year or two.
1. 1005973. There is some disagreement about D-wave machines.
2. 21
3. 35
4. 56153 = 233 × 241.
5. https://medium.com/@aditya.yadav/rsa-2048-cracked-using-shors-algorithm-on-a-quantum-computer-660cb2297a95
Interesting about factoring:

3.1 Youtube

27 Quantum Mechanics - Measurement 8:07

Shor on, what is Shor's factoring algorithm? (2:09) https://www.youtube.com/watch?v=hOlOY7NyMfs

Hacking at Quantum Speed with Shor's Algorithm (16:35) https://www.pbs.org/video/hacking-at-quantum-speed-with-shors-algorithm-8jrjkq/

43 Quantum Mechanics - Quantum factoring Period finding (19:27) https://www.youtube.com/watch?v=crMM0tCboZU

44 Quantum Mechanics - Quantum factoring Shor's factoring algorithm (25:42) https://www.youtube.com/watch?v=YhjKWAMFBUU

PAR Class 23, Thurs 2020-04-16

W Randolph Franklin (WRF), Rensselaer Polytechnic Institute (RPI)

2020-04-16 00:00

Table of contents

1 Final projects

1.1 Presentations

Mon April 20
1. Sava C & Davi P & Emil V
Thurs April 23
1. Kevi M
2. Liz C
3. Chri H
4. Zhep L
Mon April 27
1. Joe C & Alex Z
2. Mike M
3. Matt R & Skyl S
4. Eliz C
5. John F & Hayl R & Mish S
6. Ross D
7. Clar S & Garr D

1.2 Reports

Due last class day, Wed April 29.

See syllabus.

To the 2 students in the grad version: do more work.

2 Quantum computing ctd

2.1 Grover's algorithm etc

Algorithms:
1. Some, but not all, are faster.
2. Bounded-error quantum polynomial time (BQP)
  1. "is the class of decision problems solvable by a quantum computer in polynomial time, with an error probability of at most 1/3 for all instances" - https://en.wikipedia.org/wiki/BQP
  2. Includes integer factorization and discrete log.
  3. Relation to NP is unknown (big unsolved problem).
3. Grover's algorithm:
  1. https://en.wikipedia.org/wiki/Grover%27s_algorithm
  2. Given a black box with N inputs and 1 output.
  3. Exactly one input makes the output 1.
  4. Problem: which one?
  5. Classical solution: Try each input, T=N.
  6. Quantum: $T=\sqrt(N)$.
  7. Probabilistic.
  8. Apps: mean, median, reverse a crypto hash, find collisions, generate false blocks.
  9. Can extend to quantum partial search.
  10. Grover's algorithm is optimal.
  11. This suggests that NP is not in BQP .

2.2 Quantum computing on youtube

Quantum Algorithms (2:52)

Which problems can quantum computers solve exponentially faster than classical computers? David Gosset, IBM quantum computing research scientist, explains why algorithms are key to finding out.
David Deutsch - Why is the Quantum so Strange? (8:43)
Grover's Algorithm (9:57)

An overview of Grover's Algorithm. An unstructured search algorithm that can find an item in a list much faster than a classical computer can. Several sources are listed.
Bob Sutor demonstrates the IBM Q quantum computer (6:53)
Can we make quantum technology work? | Leo Kouwenhoven | TEDxAmsterdam (18:19)
"Spooky" physics | Leo Kouwenhoven | TEDxDelft (18:00)
Introduction to Quantum Computing (18) - Grover's Algorithm: Quantum Problem Statement (4:24)
Introduction to Quantum Computing (19) - Grover's Algorithm: Outline (16:09)
Introduction to Quantum Computing (20) - Grover's Algorithm: Subspace (7:42)
Introduction to Quantum Computing (21) - Grover's Algorithm: Geometric Interpretation
Introduction to Quantum Computing (23) - Grover's Algorithm: Implementation (9:33)
Introduction to Quantum Computing (24) - Grover's Algorithm: IBM Quantum Experience (11:39)

PAR Class 22, Mon 2020-04-13

W Randolph Franklin (WRF), Rensselaer Polytechnic Institute (RPI)

2020-04-13 00:00

Table of contents

1 Final project presentations

Anyone who did not sign up is signed up for Thurs April 23.

2 parallel.ecse status

I had to reinstall linux and the packages. If something is missing, please tell me. /opt, /local and your accounts should be the same. Cuda is now up to version 10.2.

The problem was a failed update that resulted in the boot image being damaged.

3 Daily joke

Smart: Guy Creates An AI Clone Of Himself To Sit In On Zoom Meetings So He Doesn't Have To

4 Quantum computing ctd

PAR Class 21, Thu 2020-04-09

W Randolph Franklin (WRF), Rensselaer Polytechnic Institute (RPI)

2020-04-06 00:00

Table of contents

1 Final project presentations
2 Thrust
- 2.1 Backends
3 Quantum computing
- 3.1 Extra material

1 Final project presentations

Remember to sign up.

2 Thrust

The Thrust examples teach several non-intuitive paradigms. As I figure them out, I'll describe a few. My descriptions are modified and expanded versions of the comments in the programs. This is not a list of all the useful programs, but only of some where I am adding to their comments.
1. arbitrary_transformation.cu and dot_products_with_zip.cu. show the very useful zip_iterator. Using it is a 2-step process.
  1. Combine the separate iterators into a tuple.
  2. Construct a zip iterator from the tuple.
  Note that operator() is now a template.
2. boundingbox.cu finds the bounding box around a set of 2D points.
  
  The main idea is to do a reduce. However, the combining operation, instead of addition, is to combine two bounding boxes to find the box around them.
  
  The combining op can be any associative op.
3. bucket_sort2d.cu overlays a grid on a set of 2D points and finds the points in each grid cell (bucket).
  1. The tuple is an efficient class for a short vector of fixed length.
  2. Note how random numbers are generated. You combine an engine that produces random output with a distribution.
    
    However you might need more complicated coding to make the numbers good when executing in parallel. See monte_carlo_disjoint_sequences.cu.
  3. The problem is that the number of points in each cell is unpredictable.
  4. The cell containing each point is computed and that and the points are sorted to bring together the points in each cell.
  5. Then lower_bound and upper_bound are used to find each bucket in that sorted vector of points.
  6. See this lower_bound description.
4. mode.cu shows:
  1. Counting the number of unique keys in a vector.
    1. Sort the vector.
    2. Do an inner_product. However, instead of the operators being times and plus, they are not equal to the next element and plus.
  2. Counting their multiplicity.
    1. Construct vectors, sized at the number of unique keys, to hold the unique keys and counts.
    2. Do a reduce_by_keys on a constant_iterator using the sorted vector as the keys. For each range of identical keys, it sums the constant_iterator. That is, it counts the number of identical keys.
    3. Write a vector of unique keys and a vector of the counts.
  3. Finding the most used key (the mode).
    1. Do max_element on the counts vector.
5. repeated_range.cu repeats each element of an N-vector K times: repeated_range([0, 1, 2, 3], 2) -> [0, 0, 1, 1, 2, 2, 3, 3]. It's a lite version of expand.cu, but uses a different technique.
  1. Here, N=4 and K=2.
  2. The idea is to construct a new iterator, repeated_range, that, when read and incremented, will return the proper output elements.
  3. The construction stores the relevant info in structure components of the variable.
  4. Treating its value like a subscript in the range [0,N*K), it divides that value by K and returns that element of its input.
  See also strided_range.cu and tiled_range.cu.

2.1 Backends

The Thrust device can be CUDA, OpenMP, TBB, etc.
You can spec it in 2 ways:
1. by adding an extra arg at the start of a function's arg list.
2. with an envar
  
  https://github.com/thrust/thrust/wiki/Host-Backends
  
  https://github.com/thrust/thrust/wiki/Device-Backends

3 Quantum computing

My summary.

3.1 Extra material

Watch A Beginner’s Guide to Quantum Computing, 18 min, by Dr. Talia Gershon, IBM Research.
Wikipedia qubit - don't read.
Readable https://en.wikipedia.org/wiki/Quantum_computing
https://qiskit.org/
https://github.com/Qiskit/qiskit-presentations.git

PAR Class 20, Mon 2020-04-06

W Randolph Franklin (WRF), Rensselaer Polytechnic Institute (RPI)

2020-04-05 00:00

Table of contents

1 Final project presentations

12 minute presentations.
Do on last 3 classes, Mon Apr 20, Thurs 23, Mon 27.
Up to 7 a class; FCFS.
Sign up on doodle; I sent everyone invitations.

2 Thrust ctd

2.1 Alternate docs in /parallel-class/thrust/doc/

GTC_2010_Part_2_Thrust_By_Example.pdf

Start at slide 27.

2.2 Examples

The point of these examples is to show you some techniques for parallel programming that are not obvious. I.e., you probably wouldn't think of them if you did not already know them.

I rewrote /parallel-class/thrust/examples-1.8/tiled_range.cu into /parallel-class/thrust/rpi/tiled_range2.cu .

It is now much shorter and much clearer. All the work is done here: ..

gather(make_transform_iterator(make_counting_iterator(0), _1%N), make_transform_iterator(make_counting_iterator(N*C), _1%N), data.begin(), V.begin());
1. make_counting_iterator(0) returns pointers to the sequence 0, 1, 2, ...
2. _1%N is a function computing modulo N.
3. make_transform_iterator(make_counting_iterator(0), _1%N) returns pointers to the sequence 0%N, 1%N, ...
4. gather populates V. The i-th element of V gets make_transform_iterator...+i element of data, i.e., the i%N-th element of data.
tiled_range3.cu is even shorter. Instead of writing an output vector, it constructs an iterator for a virtual output vector: ..

auto output=make_permutation_iterator(data, make_transform_iterator(make_counting_iterator(0), _1%N));
1. *(output+i) is *(data+(i%N)).
2. You can get as many tiles as you want by iterating.
3. tiled_range3.cu also constructs an iterator for a virtual input vector (in this case a vector of squares) instead of storing the data:
auto data = make_transform_iterator(make_counting_iterator(0), _1*_1);
tiled_range5.cu shows how to use a lambda instead of the _1 notation:

auto output=make_permutation_iterator(data, make_transform_iterator(make_counting_iterator(0), [](const int i){return i%N;} ));
1. You have to compile with --std c++11 .
2. This can be rewritten thus: ..
  
  auto f = [](const int i){return i%N;}; auto output = make_permutation_iterator(data, make_transform_iterator(make_counting_iterator(0), f));
3. The shortest lambda is this:
  
  auto f = [](){};
repeated_range2.cu is my improvement on repeated_range.cu: ..

auto output=make_permutation_iterator(data.begin(), make_transform_iterator(make_counting_iterator(0), _1/3));
1. make_transform_iterator(make_counting_iterator(0), _1/3)) returns pointers to the sequence 0,0,0,1,1,1,2,2,2, ...
Unmodified thrust examples:
1. expand.cu takes a vector like V= [0, 10, 20, 30, 40] and a vector of repetition counts, like C= [2, 1, 0, 3, 1]. Expand repeats each element of V the appropriate number of times, giving [0, 0, 10, 30, 30, 30, 40]. The process is as follows.
  1. Since the output vector will be longer than the input, the main program computes the output size, by reduce summing C, and constructs a vector to hold the output.
  2. Exclusive_scan C to obtain output offsets for each input element: C2 = [0, 2, 3, 3, 6].
  3. Scatter_if the nonzero counts into their corresponding output positions. A counting iterator, [0, 1, 2, 3, 4] is mapped with C2, using C as the stencil, giving C3 = [0, 0, 1, 3, 0, 0, 4].
  4. An inclusive_scan with max fills in the holes in C3, to give C4 = [0, 0, 1, 3, 3, 3, 4].
  5. Gather uses C4 to gather elements of V: [0, 0, 10, 30, 30, 30, 40].
2. set_operations.cu. This shows methods of handling an operation whose output is of unpredictable size. The question is, is space or time more important?
  1. If the maximum possible output size is reasonable, then construct an output vector of that size, use it, and then erase it down to its actual size.
  2. Or, run the operation twice. The 1st time, write to a discard_iterator, and remember only the size of the written data. Then, construct an output vector of exactly the right size, and run the operation again.
    
    I use this technique a lot with ragged arrays in sequential programs.
3. sparse_vector.cu represents and sums sparse vectors.
  1. A sparse vector has mostly 0s.
  2. The representation is a vector of element indices and another vector of values.
  3. Adding two sparse vectors goes as follows.
    1. Allocate temporary index and element vectors of the max possible size (the sum of the sizes of the two inputs).
    2. Catenate the input vectors.
    3. Sort by index.
    4. Find the number of unique indices by applying inner_product with addition and not-equal-to-next-element to the indices, then adding one.
      
      E.g., applied to these indices: [0, 3, 3, 4, 5, 5, 5, 8], it gives 5.
    5. Allocate exactly enough space for the output.
    6. Apply reduce_by_key to the indices and elements to add elements with the same keys.
      
      The size of the output is the number of unique keys.
What's the best way to sort 16000 sets of 1000 numbers each? E.g., sort the rows of a 16000x1000 array? On geoxeon, /pc/thrust/rpi/comparesorts.cu, which I copied from http://stackoverflow.com/questions/28150098/how-to-use-thrust-to-sort-the-rows-of-a-matrix|stackoverflow, compares three methods.
Call the thrust sort 16000 times, once per set. That took 10 secs.
Sort the whole list of 16,000,000 numbers together. Then sort it again by key, with the keys being the set number, to bring the elements of each set together. Since the sort is stable, this maintains the order within each set. (This is also how radix sort works.) That took 0.04 secs.
Call a thrust function (to sort a set) within another thrust function (that applies to each set). This is new in Thrust 1.8. That took 0.3 secs.

This is a surprising and useful paradigm. It works because

There's an overhead to starting each thrust function, and
Radix sort, which thrust uses for ints, takes linear time.

PAR Homework 7 - project status report due 2020-04-09

W Randolph Franklin (WRF), Rensselaer Polytechnic Institute (RPI)

2020-03-31 00:00

This homework is a status report on the term project.

Please include specific info about the parallel computing component.

Total: 10 pts.

PAR quantum intro, Wed 2020-03-25

W Randolph Franklin (WRF), Rensselaer Polytechnic Institute (RPI)

2020-03-24 00:00

Table of contents

1 This file is online
2 Viewpoint of this presentation
3 Entanglement
4 Classical computation
5 Quantum computation
6 My RPI course
7 IBM quantum computing
8 Quantum computing on youtube
- 8.1 Grover's algorithm
- 8.2 Shor's algorithm
- 8.3 IBM videos
- 8.4 Others

1 This file is online

https://wrf.ecse.rpi.edu/Teaching/parallel-s2020/posts/quantum.html

though that might change.

2 Viewpoint of this presentation

I don't care how qbits are implemented. That's important, even necessary, but it's someone else's problem.
This recapitulates the development of quantum computing, where the theory was worked out starting in the 1980s, long before actual quantum computers were built.

(Analogously, the theory for classical computers was worked out in the 1930s, long before real computers were built. Alan Turing contributed to both the theory and the practice.)
This presentation may have many errors, but I hope that the general tenor is not too misleading.

3 Entanglement

warmup

Classical metaphor for entanglement:

Start with a piece of paper.
Tear it into two halves.
Put each half into an envelope, seal them, and mix them up, so that you can't tell which half is in which envelope.
Address and mail one envelope to a friend in Australia, and the other to a friend in Greenland.
When the Australian opens his envelope, he knows what the Greenlander will find in his.
However that doesn't let the Australian send any info to the Greenlander, or vv.
This has been demonstrated with real qbits transported 1000 miles apart.

Technical details later.

4 Classical computation

Bit.
Its 2 possible values are 0 or 1.
Byte.
Has 8 bits.
It has 256 possible values.
Bits are transformed with gates, like nand, nor, and, or, xor, not, ...
They generally destroy info, and are not invertible.
More complex circuits, like adders, are formed from a combo of these gates.

5 Quantum computation

Qbit, $q$.
Its state is a linear combo of two basis states, $|0>$ and $|1>$:

$q = a|0> + b|1>$ ,

where $a$ and $b$ are complex numbers, and $ | a | ^2 + | b | ^2 = 1$.
IOW, its state is a superposition of those two basis states, with those weights.
It is wrong to think that $q$ is really in one of the two states, but you don't know which one. This is the hidden variable theory. It has been proved experimentally to be false.
$q$ is really in both states simultaneously.

Alice laughed. "There's no use trying," she said: "one can't believe impossible things." "I daresay you haven't had much practice," said the Queen. "When I was your age, I always did it for half-an-hour a day. Why, sometimes I've believed as many as six impossible things before breakfast." - Through the Looking-Glass, and What Alice Found There (1871), by Lewis Carroll (Charles Lutwidge Dodgson).
You cannot observe its state, unless it is $|0>$ and $|1>$, in which case you observe $0$ or $1$. This is the classical case.
Otherwise you observe it with a measurement operator that transforms it to either $|0>$ and $|1>$, with probabilities

$| a | ^2$ and $| b | ^2 = 1$, respectively.
$a$ and $b$ are complex.
That measurement changes $q$; it no longer has its old value.
$q$, that is, $q$ 's value, can be considered to be a vector of length one: $$\begin{pmatrix} a | 0> \\ b | 1> \end{pmatrix} $$ or simply $$\begin{pmatrix}a\\b\end{pmatrix}$$.
You operate on $q$ with a matrix multiplication: $q_2 = M q$.
This changes $q$; the old value is no longer available.
No cloning: You cannot copy a qbit, but can move it.
The life cycle of a qbit:
1. Create a qbit with a classical value, 0 or 1.
2. Operate on it with matrices, which rotate it to have complex weights.
3. Transform it back to 0 or 1 and read it.
So far, not very powerful.
Now, let $q$ be a system with two qbits, i.e., a 2-vector of qbits.
$q$ is now a linear combo of 4 basis values, $ | 00>$, $ | 01>$, $ | 10>$, $ | 11>$.
$q = a_0 | 00> + a_1 | 01> + a_2 | 10> + a_3 | 11> $
where $a_i$ are complex and $ \sum | a_i | ^2 = 1$.
$q$ exists in all 4 states simultaneously.
If $q$ is a vector with n component qbits, then it exists in $2^n$ states simultaneously.
This is part of the reason that quantum computation is powerful.
Measuring $q$ will collapse it to one of {00, 01, 10, 11} with probabilities $ | a_i | ^2$.
You operate on $q$ by multiplying it by a 4x4 matrix operator.
The matrices are all invertible, and all leave $ | q | = 1$.
You set the initial value of $q$ by setting its two qbits each to 0 or 1.
How this is done depends on the particular hw.
I.e., initially, $q_1 = \begin{pmatrix}a_1 | 0> \\b_1 | 1> \end{pmatrix}$ and $q_2 = \begin{pmatrix}a_2 | 0> \\b_2 | 1> \end{pmatrix}$, and so

$$q = \begin{pmatrix} q_1 \\ q_2 \end{pmatrix} = \begin{pmatrix} a_1 a_2 | 00 > \\ a_1 b_2 | 01 > \\ a_2 b_1 | 10 > \\ b_1 b_2 | 11 > \end{pmatrix}$$.
Here, the combined state is the tensor product of the individual qbits.
For $n$ qbits, the tensor product is a vector with $2^n$ elements, one element for each possible value of each qbit.
Each element of the tensor product has a complex weight.
You transform a state by multiplying it by a matrix.
The matrix is invertible.
The transformation doesn't destroy information.
When you measure a state, it collapses into one of the component states. (This may be inaccurate.)
You don't need to bring in consciousness etc. The collapse happens because the measurement causes the state to interact with the outside world.
The probability of collapsing into a particular state is the squared magnitude of its complex weight.
For some sets of weights, particularly after a transformation, the combined state cannot be separated into a tensor product of individual qbits. In this case, the individual qbits are entangled.
That is the next part of why quantum computation is powerful.
Entanglement means that if you measure one qbit then what you observe restricts what would be observed when you measure the other qbit.
However that does not let you communicate.
From page 171 of

Quantum Computing for Computer Scientists 1st Edition

All quantum algorithms work with the following basic framework:
1. The system will start with the qubits in a particular classical state.
2. From there the system is put into a superposition of many states.
3. This is followed by acting on this superposition with several unitary operations.
4. And finally, a measurement of the qubits.
1. Ways to look at measurement:
  1. Converts qbit to classical bit.
  2. Is an interaction with the external world.
  3. Information is not lost, but leaks into the external world.
  4. Is an operator represented by a matrix.
  5. that is Hermitian, i.e., it's equal to its complex transpose.
  6. For physical systems, some operators compute the system's momentum, position, or energy.
  7. The matrix's eigenvalues are real.
  8. The result of the operation is one of the eigenvalues.
More from

Quantum Computing for Computer Scientists 1st Edition
1. Compare byte with qbyte.
2. State of byte is 8 bits.
3. State of qbyte is 256 complex weights.
4. They all get modified by each operation (aka matrix multiply).
5. That is the power of quantum computing.
The current limitations are that IBM does only a few qbits and that the operation is noisy.
https://en.m.wikipedia.org/wiki/Bloch_sphere:
1. The state of a qbit can be represented as a point on or in a sphere of radius 1.
2. E.g., |1> is the north pole, |0> the south pole.
3. Many operations are rotations.
Common operations (aka gates):

https://en.wikipedia.org/wiki/Quantum_logic_gate
1. Hadamard.
  
  Creates a superposition.
  
  https://cs.stackexchange.com/questions/63859/intuition-behind-the-hadamard-gate
2. Toffoli, aka CCNOT.
  
  Universal for classical boolean functions.
  
  (a,b,c) -> (a,b, c xor (a and b))
  
  https://en.wikipedia.org/wiki/Toffoli_gate
3. Fredkin, aka CSWAP.
  
  https://en.wikipedia.org/wiki/Fredkin_gate
  
  3 inputs. Swaps last 2 if first is true.
  
  sample app: 5 gates make a 3-bit full adder.
Algorithms:
1. Some, but not all, are faster.
2. Bounded-error quantum polynomial time (BQP)
  1. "is the class of decision problems solvable by a quantum computer in polynomial time, with an error probability of at most 1/3 for all instances" - https://en.wikipedia.org/wiki/BQP
  2. Includes integer factorization and discrete log.
  3. Relation to NP is unknown (big unsolved problem).
3. Searching problems:
  1. Find the answer to a puzzle.
  2. Math examples: factor an integer, solve a polynonial equation.
  3. Testing validity of a putative solution is easy.
  4. Finding that putative solution, naively, requires testing all possibilities.
  5. Quantum computation can solve some searching problems faster.
  6. This is probabilistic or noisy; often the found solution is wrong.
  7. So you repeat the computation enough times that the error rate is acceptably low.
  8. Some classical algorithms are similar. There is an excellent probabilistic primality algorithm.
  9. The quantum algorithms are quite complex. (i.e., I'm still learning them.)
4. Algorithms, another view
  1. Hadamard matrix rotates the pure state to an entangled superposition.
  2. Then we operate in parallel on each state in the superposition.
  3. Finally we separate out the desired answer.
5. Grover's algorithm:
  1. https://en.wikipedia.org/wiki/Grover%27s_algorithm
  2. Given a black box with N inputs and 1 output.
  3. Exactly one input makes the output 1.
  4. Problem: which one?
  5. Classical solution: Try each input, T=N.
  6. Quantum: $T=\sqrt(N)$.
  7. Probabilistic.
  8. Apps: mean, median, reverse a crypto hash, find collisions, generate false blocks.
  9. Can extend to quantum partial search.
  10. Grover's algorithm is optimal.
  11. This suggests that NP is not in BQP .
6. Shor's algorithm:
  1. Factorize an integer.
  2. in BQP.
  3. almost exponentially faster than best classical algorithm.
  4. Largest examples I can find:
    1. 56153 = 233 × 241.
    2. https://medium.com/@aditya.yadav/rsa-2048-cracked-using-shors-algorithm-on-a-quantum-computer-660cb2297a95
https://quantumexperience.ng.bluemix.net/qx/tutorial ctd.

6 My RPI course

I've created ECSE-4964 Quantum Computer Programming, CRN 30195. Preliminary syllabus.

This will force me to learn the material.

As of 3/25/2020, 17 students have preregistered.

7 IBM quantum computing

They have several quantum computers.
The older ones are freely available on the web.
Those have 5 and 14 qbits.
You submit a batch job and get emailed when it runs.
IBM github site: https://github.com/Qiskit with
1. a free simulator.
  
  It doesn't match all the physical complexity of the real computer, but it's a good start.
2. and tutorials and presentations:
and a SW development framework. https://qiskit.org/
You can create a quantum computation program either by

designing a circuit, or
using a programming language.

8 Quantum computing on youtube

Quantum Algorithms (2:52)

Which problems can quantum computers solve exponentially faster than classical computers? David Gosset, IBM quantum computing research scientist, explains why algorithms are key to finding out.
David Deutsch - Why is the Quantum so Strange? (8:43)
Grover's Algorithm (9:57)

An overview of Grover's Algorithm. An unstructured search algorithm that can find an item in a list much faster than a classical computer can. Several sources are listed.
Bob Sutor demonstrates the IBM Q quantum computer (6:53)
Can we make quantum technology work? | Leo Kouwenhoven | TEDxAmsterdam (18:19)
"Spooky" physics | Leo Kouwenhoven | TEDxDelft (18:00)
David Deutsch - Why is the Quantum so Strange? (8:43)
Quantum Computing Concepts – Quantum Hardware (3:22)
IBM Introduces First Integrated Quantum Computing System for Commercial Use
The World’s First Integrated Quantum Computing System (1:06)

8.1 Grover's algorithm

8.2 Shor's algorithm

Shor on, what is Shor's factoring algorithm? (2:09) https://www.youtube.com/watch?v=hOlOY7NyMfs

Hacking at Quantum Speed with Shor's Algorithm (16:35) https://kcts9.org/programs/digital-studios-infinite-series/episodes/0121

43 Quantum Mechanics - Quantum factoring Period finding (19:27) https://www.youtube.com/watch?v=crMM0tCboZU

44 Quantum Mechanics - Quantum factoring Shor's factoring algorithm (25:42) https://www.youtube.com/watch?v=YhjKWAMFBUU

8.3 IBM videos

These are available on IBM's site (you need to create an account) and on youtube.

Go Behind-the-Scenes of a Quantum Experiment (2:10) https://quantumexperience.ng.bluemix.net/qx/community/question?questionId=5ae975690f020500399ed39a&channel=videos

or

https://www.youtube.com/watch?v=tfZpJLdkzRU&list=PLOFEBzvs-VvpzQnlazij7cL1mjKvJTAwk&index=7

A Qubit in the Making (2:01)

https://www.youtube.com/watch?v=2pB87H3_F_c&list=PLOFEBzvs-VvpzQnlazij7cL1mjKvJTAwk&index=10&t=0s

Behold the Mighty Qubit (2:51) https://www.youtube.com/watch?v=_P7K8jUbLU0&list=PLOFEBzvs-VvpzQnlazij7cL1mjKvJTAwk&index=10

Classical and Quantum Randomness (3:39) https://www.youtube.com/watch?v=8kyJfAC4VAo&list=PLOFEBzvs-VvpzQnlazij7cL1mjKvJTAwk&index=6

Quantum Entanglement (2:21) https://www.youtube.com/watch?v=RmXasxLm43k&list=PLOFEBzvs-VvpzQnlazij7cL1mjKvJTAwk&index=5

Benchmarking Quantum Systems (1:58) https://www.youtube.com/watch?v=-7L5o-mzLqU&list=PLOFEBzvs-VvpzQnlazij7cL1mjKvJTAwk&index=8

8.4 Others

Experiment with Basic Quantum Algorithms (Ali Javadi-Abhari, ISCA 2018) (19:05) https://www.youtube.com/watch?v=M1UHi9UXTWI&list=PLOFEBzvs-VvruANdBhTb-9YRDes07pZWQ&index=2

PAR Class 19, Thu 2020-04-02

W Randolph Franklin (WRF), Rensselaer Polytechnic Institute (RPI)

2020-03-09 00:00

Table of contents

1 Project proposal

To the several people who haven't uploaded it to gradescope: please do.

2 Homework 7 - status report

is online.

3 Thrust

Thrust is an API that looks like STL. Its backend can be CUDA, OpenMP, or sequential host-based code.
The online Thrust directory structure is a mess. Three main sites appear to be these:
1. https://github.com/thrust -
  1. The best way to install it is to clone from here.
  2. The latest version of the examples is also here.
  3. The wiki has a lot of doc.
2. https://thrust.github.io/
  
  This points to the above site.
3. https://developer.nvidia.com/thrust
  
  This has links to other Nvidia docs, some of which are obsolete.
4. http://docs.nvidia.com/cuda/thrust/index.html
  
  easy-to-read, thorough, obsolete, doc
5. https://code.google.com/ - no longer exists.
Functional-programming philosophy.
Many possible backends: host, GPU, OpenMP, TBB...
Easier programming, once you get used to it.
Code is efficient.
Uses some unusual C++ techniques, like overloading operator().
Since the Stanford slides were created, Thrust has adopted unified addressing, so that pointers know whether they are host or device.
On parallel in /parallel-class/thrust/ are many little demo programs from the thrust distribution, with my additions.
CUDACast videos on Thrust:

CUDACast #.15 - Introduction to Thrust

CUDACast #.16 - Thrust Algorithms and Custom Operators
Thrust is fast because the functions that look like they would need linear time really take only log time in parallel.
In functions like reduce and transform, you often see an argument like thrust::multiplies<float>(). The syntax is as follows:
1. thrust::multiplies<float> is a class.
2. It overloads operator().
3. However, in the call to reduce, thrust::multiplies<float>() is calling the default constructor to construct a variable of class thrust::multiplies<float>, and passing it to reduce.
4. reduce will treat its argument as a function name and call it with an argument, triggering operator().
5. You may also create your own variable of that class, e.g., thrust::multiplies<float> foo. Then you just say foo in the argument list, not foo().
6. The optimizing compiler will replace the operator() function call with the defining expression and then continue optimizing. So, there is no overhead, unlike if you passed in a pointer to a function.
Sometimes, e.g., in saxpy.cu, you see saxpy_functor(A).
1. The class saxpy_functor has a constructor taking one argument.
2. saxpy_functor(A) constructs and returns a variable of class saxpy_functor and stores A in the variable.
3. The class also overloads operator().
4. (Let's call the new variable foo). foo() calls operator() for foo; its execution uses the stored A.
5. Effectively, we did a closure of saxpy_functor; this is, we bound a property and returned a new, more restricted, variable or class.

3.1 Bug

I found and reported a bug in version 100904. This version does not work with OpenMP. It was immediately closed because they already knew about it. They just hadn't told us users.

Awhile ago I reported a bug in nvcc, where it went into an infinite loop for a certain array size. The next minor version of CUDA was released the next day.

Two observations:

I'm good at breaking SW by expecting it to meet its specs.
Nvidia is responsive, which I like.

3.2 Stanford's lectures

Continue Stanford's parallel course notes.

Lecture 8: Thrust ctd, starting at slides 24.

There's a lot of stuff here.

3.3 Alternate docs in /parallel-class/thrust/doc/

An_Introduction_To_Thrust.pdf
GTC_2010_Part_2_Thrust_By_Example.pdf

PAR Class 18, Mon 2020-03-30

W Randolph Franklin (WRF), Rensselaer Polytechnic Institute (RPI)

2020-03-08 00:00

Table of contents

1 Ensure that you're in piazza

I use this for quick announcements that are not so much of permanent interest, and for things that don't necessarily need to be on the global internet forever.

I think I added everyone who wasn't already in it, but for complicated reasons might have missed someone.

So, if you're not in it, please add yourself or email me to add you.

I'll send a test message. If you don't get it, check your spam filter. RPI's spam filter recently blocked my own messages to piazza being forwarded to myself.

2 My Quantum Intro blog posting

I prepared it for an internal RPI research discussion last week. RPI faculty in several departments are considering how to do research on this. A group of RPI people, including two vice presidents, visited IBM before the break to talk about this. I'll work this into this course later, but you're welcome to read it now.

3 Linux HMM (Heterogeneous Memory Management)

This is hardware support to make programming devices like GPUs easier. It took years to get into the official kernel, presumably because you don't want bugs in memory management.

This is also a nice intro into current operating systems concerns. Do our operating systems courses talk about this?

4 Several forms of C++ functions

Traditional top level function

auto add(int a, int b) { return a+b;}

You can pass this to a function. This really passes a pointer to the function. It doesn't optimize across the call.
Overload operator() in a new class

Each different variable of the class is a different function. The function can use the variable's value. This is a closure.

This is local to the containing block.

This form optimizes well.
Lambda, or anon function.

auto add = [](int a, int b) { return a+b;};

This is local to the containing block.

This form optimizes well.
Placeholder notation.

As an argument in, e.g., transform, you can do this:

transform(..., _1+_2);

This is nice and short.

As this is implemented by overloading the operators, the syntax of the expression is limited to what was overloaded.

PAR Class 17, Thu 2020-03-26

W Randolph Franklin (WRF), Rensselaer Polytechnic Institute (RPI)

2020-03-07 00:00

Table of contents

1 Mediatesite

The videos of our classes will be in my Mediatesite channel ECSE-4740 Applied Parallel Computing for Engineers.

Please report any problems. I have no easy way to see the system from a student's perspective.

2 Thrust

Stanford's parallel course notes.

Starting with lecture 5, which shows us a little CUDA code and, more important, optimizing techniques and parallel paradigms.

I showed starting from lecture 5 through lecture 8, slide 19.
The github repository, with demo is https://github.com/thrust/thrust.git .

Nvidia's proprietary version is slightly newer.
The most comprehensive doc is online at http://thrust.github.io/doc/index.html

It is badly written and slightly obsolete.
There are various tutorials online, most obsolescent. E.g., they don't use C++-11 lambdas, which are a big help.
Look at some Thrust programs in /parallel-class/cuda/thrust
One Nvidia-sponsored alternative is agency, at https://github.com/agency-library/ .
There are other alternatives that I'll mention later.
The alternatives are lower-level (= faster and harder to use) and newer (= possibly less debugged, fewer users).
However OpenACC now looks competitive.
The biggest problem with Thrust is that it appears that Nvidia has de-emphasized it, and appears to be making it proprietary. The two latest versions of Thrust do not allow Intel as a backend. That is a bad sign, and may be a reason to stop using it. The Thrust developers say that this is temporary but they haven't fixed it in the release version.

PAR Class 16, Mon 2020-03-23

W Randolph Franklin (WRF), Rensselaer Polytechnic Institute (RPI)

2020-03-06 00:00

Table of contents

1 About this unprecedented situation
2 Course computer tools
- 2.1 Webex
- 2.2 Piazza
- 2.3 This blog
- 2.4 Gradescope
- 2.5 parallel.ecse
3 Today
- 3.1 Final round 2 student talks
- 3.2 OpenACC
- 3.3 CUDA

1 About this unprecedented situation

RPI wants to continue to give you a solid education.

We recognize that new problems may arise and will try to be humane.

You will still have to study.

Anything listed here is only a best attempt and might be changed if there's a good reason.

Your feedback is always welcome.

2 Course computer tools

2.1 Webex

For the rest of the semester, the idea is that I'll still lecture at the appointed times, Mon and Thurs at noon eastern time.

However, now I'll try to lecture with Webex. You can access it via your favorite browser. Or, you can use an app on some platforms like the Ipad.

The lectures will be recorded so that you can rewatch them later.

I'll post the link on piazza with email to you before the lecture.

2.2 Piazza

I'll use piazza---

to push announcements to the class,
to post material that might not be completely public,
for questions and answers.

2.3 This blog

This blog will continue to be sort of a permanent record of the class.

2.4 Gradescope

Gradescope will continue to be used for grading.

2.5 parallel.ecse

parallel.ecse will continue to be available remotely.

3 Today

3.1 Final round 2 student talks

Two remaining students will present your talks.

If possible, post your slides in advance or email them to me for posting.

3.2 OpenACC

Quickly finish the 3 slide sets.

From https://www.openacc.org/events/openacc-online-course-2018.

My local copy on parallel: /parallel-class/openacc/online-course

3.3 CUDA

I've mentioned this many times. Now we'll get into details.

Stanford's parallel course notes.

Lecture 5: Finish CUDA,

Lectures 6, 7: Some parallel patterns.

Sample programs.
1. /local/cuda/samples has many programs by Nvidia. Some interesting ones:
  1. /local/cuda/samples/1_Utilities/deviceQuery/deviceQuery describes the host's GPUs.
  2. /local/cuda/samples/0_Simple/ does some benchmarks.
2. Stanfords examples are in parallel-class/stanford/tutorials .

PAR Class 15, Thu 2020-03-05

W Randolph Franklin (WRF), Rensselaer Polytechnic Institute (RPI)

2020-03-05 00:00

Table of contents

1 Student talks, round 2, day 2

Clara
Zhepeng
Misha
Matthew & Skylar
Savannah & Emily
Elizabeth
Mike
Garret

10

2 Next topic

Thrust. Parallel version of STL running on top of CUDA. Shows some surprising parallel map-reduce algorithms.

PAR Class 14, Mon 2020-03-02

W Randolph Franklin (WRF), Rensselaer Polytechnic Institute (RPI)

2020-03-01 00:00

Table of contents

1 Student talks, round 2, day 1

Joseph & Alexander
Kevin
Chris
Ross
David
John & Hayley

8

2 OpenACC slides

To fill in the remaining time.

From https://www.openacc.org/events/openacc-online-course-2018

My local copy on parallel: /parallel-class/openacc/online-course

PAR Homework 6 - talks round 2, Mar 2 & 5, 2020

W Randolph Franklin (WRF), Rensselaer Polytechnic Institute (RPI)

2020-03-01 00:00

This homework is the 2nd round of student talks.

Total: 10 pts.

PAR Class 13, Thu 2020-02-27

W Randolph Franklin (WRF), Rensselaer Polytechnic Institute (RPI)

2020-02-22 00:00

Table of contents

1 Today's reading - intro to quantum computing

1 Today's reading - intro to quantum computing

No lecture.
In preparation for quantum computing, which I'll start in a few weeks, read Quantum Computing for Computer Scientists slides and video

PAR Class 12, Mon 2020-02-24

W Randolph Franklin (WRF), Rensselaer Polytechnic Institute (RPI)

2020-02-21 00:00

Table of contents

1 Term project

See the syllabus.
Proposal due March 12.
Progress reports on March 26, April 9 (Thu)
Presentations in class April 20 (Mon), 23 (Thu), 27 (Mon).
Paper etc due on Wed April 29 (last class).

2 No class Thu Feb 27

A group of us will be visiting IBM's Quantum Computing group.

In preparation for quantum computing, which I'll start in a few weeks, read Quantum Computing for Computer Scientists slides and video

3 Types of memory allocation

Here's a brief overview of my understanding of the various places that you can assign memory in a program.

Static. Define a fixed-size array global array. The variable is constructed at compile time, so accesses might perhaps be faster. Global vars with non default initial values increase the executable file size. If they're large enough, you need to use the compiler option -mcmodel=medium or -mcmodel=large. They cause the compiler to generate wider addresses. I don't know the effect on the program's size or speed, but suspect that it's small.
Stack. Define local arrays, that are created and freed as the routine is entered and exited. Their addresses relative to the base of this call frame may be constant. The default stack size is 8MB. You can increase this with the command ulimit or in the program as shown in stacksize.cc. I believe that in OpenMP, the max stacksize may be allocated when each thread is created. Then, a really big stackssize might have a penalty.
Heap. You use new and destroy. Variables are constructed whenever you want. The more objects on the heap, the more time that each new or destroy takes. If you have lots of objects consider using placement new or creating an array of them.

For CUDA, some variables must be on the heap.

I like to use static, then stack, and heap only when necessary. However, allocating few but large, blocks on the heap is also fast.

Google's allocator is noticeably better than the default one. To use it, link your programs with -ltcmalloc. You can often use it on an existing executable foo thus:

LD_PRELOAD="/usr/lib/libtcmalloc.so" foo

I found it to save 15% to 30% in time.

Another memory concern is speed. Parallel has a NUMA (Non Uniform Memory Architecture). It has two 14-core Xeons. Each core has 128GB of main memory. Although all 256GB are in a common address space, accessing memory on same core as the thread is running on is faster.

The following is what I think based on some research, but may be wrong: A 4KB page of memory is assigned to a specific core when it is first written (not when it is reserved). So, each page of a large array may be on a different core. This can be used to optimize things. This gets more fun with 8-processor systems.

All that is separate from cache issues.

You can also assign your OpenMP threads to specific cores. This affects speed in ways I don't understand. The issues are resource sharing vs conflicts.

4 parallel.ecse hardware details

I put the invoice on parallel.ecse in /parallel-class/ . It gives the hardware specifics.

Since the initial purchase, parallel has been modified:

The Intel MIC coprocessor was removed.
The Nvidia RTX 8000 was added.

5 Nvidia GPU summary

Here's a summary of the Nvidia Pascal GP104 GPU architecture as I understand it. It's more compact than I've found elsewhere. I'll add to it from time to time. Some numbers are probably wrong.

The host is the CPU.
The device is the GPU.
The device contains 20 streaming multiprocessors (SMs).

Different GPU generations have used the terms SMX or SMM.
A thread is a sequential program with private and shared memory, program counter, etc.
Threads are grouped, 32 at a time, into warps.
Warps of threads are grouped into blocks.

Often the warps are only implicit, and we consider that the threads are grouped directly into blocks.

That abstract hides details that may be important; see below.
Blocks of threads are grouped into a grid, which is all the threads in the kernel.
A kernel is a parallel program executing on the device.
1. The kernel runs potentially thousands of threads.
2. A kernel can create other kernels and wait for their completion.
3. There may be a limit, e.g., 5 seconds, on a kernel's run time.
Thread-level resources:
1. Each thread can use up to 255 fast registers. Registers are private to the thread.
  
  All the threads in one block have their registers allocated from a fixed pool of 65536 registers. The more registers that each thread uses, the fewer warps in the block can run simultaneously.
2. Each thread has 512KB slow local memory, allocated from the global memory.
3. Local memory is used when not enough registers are available, and to store thread-local arrays.
Warp-level resources:
1. Threads are grouped, 32 at a time, into warps.
2. Each warp executes as a SIMD, with one instruction register. At each cycle, every thread in a warp is either executing the same instruction, or is disabled. If the 32 threads want to execute 32 different instructions, then they will execute one after the other, sequentially.
  
  If you read in some NVidia doc that threads in a warp run independently, then continue reading the next page to get the info mentioned in the previous paragraph.
3. If successive instructions in a warp do not depend on each other, then, if there are enough warp schedulers available, they may be executed in parallel. This is called Instruction Level Parallelism (ILP).
4. For an array in local memory, which means that each thread will have its private copy, the elements for all the threads in a warp are interleaved to potentially increase the I/O rate.
  
  Therefore your program should try to have successive threads read successive words of arrays.
5. A thread can read variables from other threads in the same warp, with the shuffle instruction. Typical operation are to read from the K-th next thread, to do a butterfly permutation, or to do an indexed read. This happens in parallel for the whole warp, and does not use shared memory.
6. A warp vote combines a bit computed by each thread to report results like all or any.
Block-level resources:
1. A block may contain up to 1024 threads.
2. Each block has access to 65536 fast 32-bit registers, for the use of its threads.
3. Each block can use up to 49152 bytes of the SM's fast shared memory. The block's shared memory is shared by all the threads in the block, but is hidden from other blocks.
  
  Shared memory is basically a user-controllable cache of some global data. The saving comes from reusing that shared data several times after you loaded it from global memory once.
  
  Shared memory is interleaved in banks so that some access patterns are faster than others.
4. Warps in a block run asynchronously and run different instructions. They are scheduled and executed as resources are available.
5. However they are all running the same instruction sequence, perhaps at different points in it.
6. That is call SPMD, single program multiple data.
7. The threads in a block can be synchonized with __syncthreads().
  
  Because of how warps are scheduled, that can be slow.
8. The threads in a block can be arranged into a 3D array, up to 1024x1024x64.
  
  That is for convenience, and does not increase performance (I think).
9. I'll talk about textures later.
Streaming Multiprocessor (SM) - level resources:
1. Each SM has 128 single-precision CUDA cores, 64 double-precision units, 32 special function units, and 32 load/store units.
2. In total, the GPU has 2560 CUDA cores.
3. A CUDA core is akin to an ALU. The cores, and all the units, are pipelined.
4. A CUDA core is much less powerful than one core of an Intel Xeon. My guess is 1/20th.
5. Beware that, in the CUDA C Programming Guide, NVidia sometimes calls an SM a core.
6. The limited number of, e.g., double precision units means that an DP instruction will need to be scheduled several times for all the threads to execute it. That's why DP is slower.
7. Each SM has 4 warp schedulers and 8 instruction dispatch units.
8. 64 warps can simultaneously reside in an SM.
9. Therefore up to 32x64=2048 threads can be executed in parallel by an SM.
10. Up to 16 blocks that can simultaneously be resident in an SM.
  
  However, if each block uses too many resources, like shared memory, then this number is reduced.
  
  Each block sits on only one SM; no block is split. However a block's warps are executed asynchronously (until synced).
11. Each SM has 64KiB (?) fast memory to be divided between shared memory and an L1 cache. Typically, 48KiB (96?) is used for the shared memory, to be divided among its resident blocks, but that can be changed.
12. The 48KB L1 cache can cache local or global memory.
13. Each SM has a read-only data cache of 48KB to cache the global constant memory.
14. Each SM has 8 texture units, and many other graphics capabilities.
15. Each SM has 256KB of L2 cache.
Grid-level resources:
1. The blocks in a grid can be arranged into a 3D array. up to $(2^{31}-1, 2^{16}-1, 2^{16}-1)$.
2. Blocks in a grid might run on different SMs.
3. Blocks in a grid are queued and executed as resources are available, in an unpredictable parallel or serial order. Therefore they should be independent of each other.
4. The number of instructions in a kernel is limited.
5. Any thread can stop the kernel by calling assert.
Device-level resources:
1. There is a large and slow 48GB global memory, which persists from kernel to kernel.
  
  Transactions to global memory are 128 bytes.
  
  Host memory can also be memory-mapped into global memory, although the I/O rate will be lower.
  
  Reading from global memory can take hundreds of cycles. A warp that does this will be paused and another warp started. Such context switching is very efficient. Therefore device throughput stays high, although there is a latency. This is called Thread Level Parallelism (TLP) and is a major reason for GPU performance.
  
  That assumes that an SM has enough active warps that there is always another warp available for execution. That is a reason for having warps that do not use all the resources (registers etc) that they're allowed to.
2. There is a 2MB L2 cache, for sharing data between SMs.
3. There is a 64KiB Small and fast global constant memory, , which also persists from kernel to kernel. It is implemented as a piece of the global memory, made fast with caches.
  
  (Again, I'm still resolving this apparent contradiction).
4. Grid Management Unit (GMU) schedules (pauses, executes, etc) grids on the device. This is more important because grids can start other grids (Dynamic Parallelism).
5. Hyper-Q: 32 simultaneous CPU tasks can launch kernels into the queue; they don't block each other. If one kernel is waiting, another runs.
6. CUDA Work Distributor (CWD) dispatches 32 active grids at a time to the SMs. There may be 1000s of grids queued and waiting.
7. GPU Direct: Other devices can DMA the GPU memory.
8. The base clock is 1607MHz.
9. GFLOPS: 8873.
10. Memory bandwidth: 320GB/s
GPU-level resources:
1. Being a Geforce product, there are many graphics facilities that we're not using.
2. There are 4 Graphics processing clusters (GPCs) to do graphics stuff.
3. Several perspective projections can be computed in parallel, for systems with several displays.
4. There's HW for texture processing.
Generational changes:
1. With each new version, Nvidia tweaks the numbers. Some get higher, others get lower.
  1. E.g., Maxwell had little HW for double precision, and so that was slow.
  2. Pascal's clock speed is much higher.
Refs:
1. The CUDA program deviceDrv.
2. http://developer.download.nvidia.com/compute/cuda/compute-docs/cuda-performance-report.pdf
3. http://international.download.nvidia.com/geforce-com/international/pdfs/GeForce_GTX_1080_Whitepaper_FINAL.pdf
4. Better Performance at Lower Occupancy, Vasily Volkov, UC Berkeley, 2010.
5. https://www.pgroup.com/lit/articles/insider/v2n1a5.htm - well written but old.
(I'll keep adding to this. Suggestions are welcome.)

6 More CUDA

CUDA function qualifiers:
1. __global__ device function called from host, starting a kernel.
2. __device__ device function called from device function.
3. __host__ (default) host function called from host function.
CUDA variable qualifiers:
1. __shared__
2. __device__ global
3. __device__ __managed__ automatically paged between host and device.
4. __constant__
5. (nothing) register if scalar, or local if array or if no more registers available.
If installing CUDA on your machine, this repository seems best:

http://developer.download.nvidia.com/compute/cuda/repos/ubuntu1604/x86_64

That includes the Thrust headers but not example programs.

7 Unionfs: Linux trick of the day

aka overlay FS, translucent FS.
If a, b are directories, and m is an empty directory, then

unionfs -o cow a=RW:b m

makes m to be a combo of a and b, with a being higher priority
Writing a file into m writes it in a.
Changing a file in b writes the new version into a
Deleting a file in b causes a white-out note to be stored in a.
Unmount it thus:

fusermount -u m
None of this requires superuser.
Application: making a read-only directory into a read-write directory.
Note: IBM had a commercial version of this idea in its CP/CMS OS in the 1960s.

8 OpenCL

Module 20.
Apple's competition to CUDA.
is largely CUDA but they changed the names.
not interesting so long as Nvidia is dominant.

9 Nvidia GPU and accelated computing, ctd.

This material accompanies Programming Massively Parallel Processors A Hands-on Approach, Third Edition, David B. Kirk Wen-mei W. Hwu. I recommend it. (The slides etc are free but the book isn't.)

Finishing /parallel-class/GPU-Teaching-Kit-Accelerated-Computing, Module 21, OpenACC.

10 OpenACC

Adds pragmas to C++ code, like OpenMP.
Unlike OpenMP, targets accelerators, e.g., GPUs.
Tries to be higher level than OpenMP.
E.g., kernel pragma says to try to parallelize everything you can.
Nevertheless, the more specific parallel pragma probably leads to better code.
G++ does OpenACC badly.
Use PGI compiler, which Nvidia bought.
In parallel, I've installed it both directly and in a docker file.
Good books:
1. OpenACC for Programmers: Concepts and Strategies By: Sunita Chandrasekaran, Guido Juckeland kindle $31.19
2. Parallel Programming with OpenACC By: Rob Farber kindle: $37.74
  
  https://github.com/rmfarber/ParallelProgrammingWithOpenACC.

PAR Class 11, Thu 2020-02-20

W Randolph Franklin (WRF), Rensselaer Polytechnic Institute (RPI)

2020-02-19 00:00

Table of contents

1 Term project
2 No class Thu Feb 27
3 Student talks in class, round 2
4 Nvidia conceptual hierarchy
5 GPU range of speeds
6 CUDA
- 6.1 Versions
- 6.2 Misc
7 Nvidia GPU and accelated computing, ctd.

1 Term project

See the syllabus.
Proposal due March
Progress reports on March 26, April 9 (Thu)
Presentations in class April 20 (Mon), 23 (Thu), 27 (Mon).
Paper etc due on Wed April 29 (last class).

2 No class Thu Feb 27

A group of us will be visiting IBM's Quantum Computing group.

On Monday I'll make suggestions for how to occupy your time.

3 Student talks in class, round 2

Choose a topic (or group of topics) from the 589 on-demand sessions at the GPU Tech Conference.
Summarize it in class. OK to show the whole presentation (or parts of it) with your commentary.
Groups of two people are allowed. However, this time, please keep your talks to 6 min.
Presentation dates: March 2 and 5.
No need to sign up for your topic since there'll probably be few overlaps. However here's a signup site for your presentation date.

https://doodle.com/poll/hwqrmeyxgsvk2mzk

I've allowed up to 11 talks per day.

Please enter your name and rcsid.

4 Nvidia conceptual hierarchy

As always, this is as I understand it, and could be wrong. Nvidia uses their own terminology inconsistently. They may use one name for two things (E.g., Tesla and GPU), and may use two names for one thing (e.g., module and accelerator). As time progresses, they change their terminology.

At the bottom is the hardware micro-architecture. This is an API that defines things like the available operations. The last several Nvidia micro-architecture generations are, in order, Tesla (which introduced unified shaders), Fermi, Kepler, Maxwell (introduced in 2014), Pascal (2016), and Volta (2018).
Each micro-architecture is implemented in several different microprocessors. E.g., the Kepler micro-architecture is embodied in the GK107, GK110, etc. Pascal is GP104 etc. The second letter describes the micro-architecture. Different microprocessors with the same micro-architecture may have different amounts of various resources, like the number of processors and clock rate.
To be used, microprocessors are embedded in graphics cards, aka modules or accelerators, which are grouped into series such as GeForce, Quadro, etc. Confusingly, there is a Tesla computing module that may use any of the Tesla, Fermi, or Kepler micro-architectures. Two different modules using the same microprocessor may have different amounts of memory and other resources. These are the components that you buy and insert into a computer. A typical name is GeForce GTX1080.
There are many slightly different accelerators with the same architecture, but different clock speeds and memory, e.g. 1080, 1070, 1060, ...
The same accelerator may be manufactured by different vendors, as well as by Nvidia. These different versions may have slightly different parameters. Nvidia's reference version may be relatively low performance.
The term GPU sometimes refers to the microprocessor and sometimes to the module.
There are at least four families of modules: GeForce for gamers, Quadro for professionals, Tesla for computation, and Tegra for mobility.
Nvidia uses the term Tesla in two unrelated ways. It is an obsolete architecture generation and a module family.
Geoxeon has a (Maxwell) GeForce GTX Titan and a (Kepler) Tesla K20xm. Parallel has a (Volta) RTX 8000 and (Pascal) GeForce GTX 1080. We also have an unused (Kepler) Quadro K5000.
Since the highest-end (Tesla) modules don't have video out, they are also called something like compute modules.

5 GPU range of speeds

Here is an example of the wide range of Nvidia GPU speeds; all times are +-20%.

The Quadro RTX 8000 has 4608 CUDA cores @ 1.77GHz and 48GB of memory. matrixMulCUBLAS runs at 5310 GFlops. The specs claim 16 TFlops. However those numbers understate its capabilities because it also has 576 Tensor cores and 72 ray tracing cores to cast 11G rays/sec.

The GeForce GTX 1080 has 2560 CUDA cores @ 1.73GHz and 8GB of memory. matrixMulCUBLAS runs at 3136 GFlops. However the reported time (0.063 msec) is so small that it may be inaccurate. The quoted speed of the 1080 is about triple that. I'm impressed that the measured performance is so close.

The Quadro K2100M in my Lenovo W540 laptop has 576 CUDA cores @ 0.67 GHz and 2GB of memory. matrixMulCUBLAS runs at 320 GFlops. The time on the GPU was about .7 msec, and on the CPU 600 msec.

It's nice that the performance almost scaled with the number of cores and clock speed.

6 CUDA

6.1 Versions

CUDA has a capability version, whose major number corresponds to the micro-architecture generation. Kepler is 3.x. The K20xm is 3.5. The GTX 1080 is 6.1. The RTX 8000 is 7.5. Here is a table of the properties of different compute capabilities. However, that table is not completely consistent with what deviceQuery shows, e.g., the shared memory size.
nvcc, the CUDA compiler, can be told which capabilities (aka architectures) to compile for. They can be given as a real architecture, e.g., sm_61, or a virtual architecture. e.g., compute_61.
The CUDA driver and runtime also have a software version, defining things like available C++ functions. The latest is 10.1. This is unrelated to the capability version.

6.2 Misc

With CUDA, the dominant problem in program optimization is optimizing the data flow. Getting the data quickly to the cores is harder than processing it. It helps big to have regular arrays, where each core reads or writes a successive entry.

This is analogous to the hardware fact that wires are bigger (hence, more expensive) than gates.
That is the opposite optimization to OpenMP, where having different threads writing to adjacent addresses will cause the false sharing problem.
Nvidia CUDA FAQ
1. has links to other Nvidia docs.
2. can be a little old.

7 Nvidia GPU and accelated computing, ctd.

This material accompanies Programming Massively Parallel Processors A Hands-on Approach, Third Edition, David B. Kirk Wen-mei W. Hwu. I recommend it. (The slides etc are free but the book isn't.)

Continuing /parallel-class/GPU-Teaching-Kit-Accelerated-Computing, starting at Modules 17 slide 14.

PAR Class 10, Tues 2020-02-18

W Randolph Franklin (WRF), Rensselaer Polytechnic Institute (RPI)

2020-02-18 00:00

Table of contents

1 More on ssh

ssh-agent on the source machine starts a daemon to manage your keys. I start it in a login script.
ssh-add on the source machine creates a key pair in ~/.ssh . Do it once.
ssh-add -l lists the keys registered with ssh-agent, i.e., that are available to use with future ssh commands in this session.
~/.ssh/authorized_keys on the destination machine stores the public keys of source machines allowed to connect. Copy your source machines' public keys into it. Do that once.
If you use a different user name on the destination machine, instead of doing user@destination all the time, you can set defaults on the source machine, for the destination user names, in ~/.ssh/config.
ssh -v destination shows the handshaking.
Now ssh, scp, Emacs tramp mode, mounting the remote filesystem, etc. work w/o typing passwords.

2 PGI compilers on parallel.ecse

I've installed pgc++ directly. Run it thus:

/local/pgi/linux86-64/19.10/bin/pgc++ foo.cc -o foo

Here's a set of good switches:

/local/pgi/linux86-64/19.10/bin/pgc++ -fast -mp -Msafeptr -O3 -Minfo=all -Mconcur=allcores  -ta:tesla -acc foo.cc -o foo

In parallel.ecse:/parallel-class/matmul, I experiment with gcc and pgi with openmp running on the Xeon, and openacc running on the GPU, multiplying matrices stored as global data, on the local stack, as STL vectors of vectors, and as a heap array. Some of my conclusions:

For sequential code, g++ is twice as fast as pgic++.
OpenMP works well.
OpenACC works only on pgc++. It is very fast.

3 Nvidia GPU and accelated computing, ctd.

This material accompanies Programming Massively Parallel Processors A Hands-on Approach, Third Edition, David B. Kirk Wen-mei W. Hwu. I recommend it. (The slides etc are free but the book isn't.)

Continuing /parallel-class/GPU-Teaching-Kit-Accelerated-Computing, Modules 13 to 17 slide 14.

4 Notes about parallel.ecse

Since it has 256GB of main memory, there's no paging, and pinning memory is not a problem.

Linux now has Heterogeneous Memory Management (HMM) .

PAR Homework 5, due Thu 2020-02-20, noon

W Randolph Franklin (WRF), Rensselaer Polytechnic Institute (RPI)

2020-02-13 00:00

Rules

Submit the answers to Gradescope.
You may do homeworks in teams of 2 students. Create a gradescope team and make one submission with both names.
For redundancy, at the top of your submitted homework write what it is and who you are. E.g., "Parallel Homework 2, 1/30/20, by Boris Badenov and Natasha Fatale".
Each question is 10 points.

Questions

Assume that a kernel is launched with 1000 thread blocks each of which has 512 threads. If a variable is declared as a shared memory variable, how many versions of the variable will be created through the lifetime of the execution of the kernel? (module 4)
Assume that each atomic operation in a DRAM system has a total latency of 100ns. What is the maximal throughput we can get for atomic operations on the same global memory variable? (module 7)
For a processor that supports atomic operations in L2 cache, assume that each atomic operation takes 4ns to complete in L2 cache and 100ns to complete in DRAM. Assume that 90% of the atomic operations hit in L2 cache. What is the approximate throughput for atomic operations on the same global memory variable?

For the following basic reduction kernel code fragment, if the block size is 1024 and warp size is 32, how many warps in a block will have divergence during the iteration where stride is equal to 1? (module 9):

unsigned int t = threadIdx.x;
Unsigned unsigned int start = 2*blockIdx.x*blockDim.x;
partialSum[t] = input[start + t];
partialSum[blockDim.x+t] = input[start+ blockDim.x+t];
for (unsigned int stride = 1; stride <= blockDim.x; stride *= 2)
{
__syncthreads();
if (t % stride == 0) {partialSum[2*t]+= partialSum[2*t+stride];}
}

In the previous question, how many warps in a block will have divergence during the iteration where stride is equal to 16?
For the work inefficient scan kernel based on reduction trees, assume that we have 1024 elements, which of the following gives the closest approximation of the number of add operations performed? (module 10)
1. (1024-1)*2
2. (512-1)*2
3. 1024*1024
4. 1024*10
Which of the following statements is true? (module 14)
1. Data transfer between CUDA device and host is done by DMA hardware using virtual addresses.
2. The OS always guarantees that any memory being used by DMA hardware is not swapped out.
3. If a pageable data is to be transferred by cudyMemcpy(), it needs to be first copied to a pinned memory buffer before transferred.
4. Pinned memory is allocated with cudaMalloc() function.
What is the CUDA API call that makes sure that all previous kernel executions and memory copies in a device have been completed?
1. __syncthreads()
2. cudaDeviceSynchronize()
3. cudaStreamSynchronize()
4. __barrier()

Total: 90 pts.

PAR Class 9, Thu 2020-02-13

W Randolph Franklin (WRF), Rensselaer Polytechnic Institute (RPI)

2020-02-12 00:00

Table of contents

1 Nvidia GPU and accelated computing, ctd.

1 Nvidia GPU and accelated computing, ctd.

This material accompanies Programming Massively Parallel Processors A Hands-on Approach, Third Edition, David B. Kirk Wen-mei W. Hwu. I recommend it. (The slides etc are free but the book isn't.)

Continuing /parallel-class/GPU-Teaching-Kit-Accelerated-Computing, Modules 9 to 12.

PAR Class 8, Mon 2020-02-10

W Randolph Franklin (WRF), Rensselaer Polytechnic Institute (RPI)

2020-02-10 00:00

Table of contents

1 Docker

Docker is worth learning, apart from its use by Nvidia for parallel computing. You might also look up Kubernetes.

On parallel, I think that docker is probably secure, but am not certain. In return for my allowing access to docker on parallel, I expect you to report, and not to exploit, holes that you might discover.

Compiling:

docker run --gpus all -v $PWD:/tmp -it nvcr.io/hpc/pgi-compilers:ce
Inside docker: pcg++ -mp -Minfo=all foo.cc

2 Nvidia GPU and accelated computing, ctd.

This material accompanies Programming Massively Parallel Processors A Hands-on Approach, Third Edition, David B. Kirk Wen-mei W. Hwu. I recommend it. (The slides etc are free but the book isn't.)

Continuing /parallel-class/GPU-Teaching-Kit-Accelerated-Computing, Modules 4 to 8.

PAR Homework 4, due Thu 2020-02-13, noon

W Randolph Franklin (WRF), Rensselaer Polytechnic Institute (RPI)

2020-02-06 00:00

Rules

Submit the answers to Gradescope.
You may do homeworks in teams of 2 students. Create a gradescope team and make one submission with both names.
For redundancy, at the top of your submitted homework write what it is and who you are. E.g., "Parallel Homework 2, 1/30/20, by Boris Badenov and Natasha Fatale".

Questions

These may require some research.

(10 points) Compare and contrast these different types of memory on the Nvidia GPU. How big are they? How fast are they? How global is their visibility?
1. register
2. shared
3. local
4. global
(10 points) Compare and contrast on size, synchronization of components, order in hierarchy.
1. thread block
2. thread warp
3. grid
(10 points) In what way is the simple CUDA example in Module 2 obsolete?
(10 points) On parallel, according to /local/cuda/samples/0_Simple/matrixMulCUBLAS how many FLOPS is the RTX 8000 on parallel? How does that compare to your program (that didn't use the GPU)?

Total: 40 pts.

PAR Class 7, Thu 2020-02-06

W Randolph Franklin (WRF), Rensselaer Polytechnic Institute (RPI)

2020-02-05 00:00

Table of contents

1 Docker on parallel

I've installed docker, a popular lightweight virtualization system, on parallel, because Nvidia uses it to distribute SW.
Docker runs images that define virtual machines.
Docker images share resources with the host, in a controlled manner.
You can install private copies of images, or to see what images I've installed, do: docker images .
Run the hello-world image thus: docker run hello-world
Here's a more complicated example:

docker run -it --mount type=bind,source=/parallel-class,destination=/parallel-class --mount type=bind,source=$HOME,destination=/home --gpus=all nvidia/cuda:10.1-devel

This interactively runs a virtual machine with
1. Nvidia's CUDA development tools
2. access to parallel's GPUs
3. access to parallel's /parallel-class, mounted locally at /parallel-class .
4. access to your home dir, mounted at /home.
E.g., go into /parallel-class/openmp/rpi and run some programs.
Copy some .cc files to your home dir, compile, and run them.
There are ways to make the image's contents persistent. E.g., you can customize and save an image.
For simple images, which is not nvidia/cuda, starting the image is so cheap that you can do it to run one command, and encapsulate the whole process in a shell function. More later. However, e.g.,

docker run --gpus=all nvidia/cuda:10.1-devel nvidia-smi
parallel has CUDA sample programs in /local/cuda/samples. To make them available in a docker image, include -v local/cuda/samples:/samples

2 Nvidia GPU and accelated computing, ctd.

Continuing /parallel-class/GPU-Teaching-Kit-Accelerated-Computing at Module 3.

PAR Class 6, Mon 2020-02-03

W Randolph Franklin (WRF), Rensselaer Polytechnic Institute (RPI)

2020-02-02 00:00

Table of contents

1 Student talks (round 1) finished

Note: in a month, we'll have another round.

2 Nvidia GPU and accelated computing

This is from https://developer.nvidia.com/teaching-kits-downloads

My local copy of what I'm using is in /parallel-class/GPU-Teaching-Kit-Accelerated-Computing

Today we did through module 2.

3 Changes this year

For students who've looked at last year's course, here are some upcoming changes.

I'm dropping Stanford's parallel course. It's very nice but just too old.
I'm adding material from Nvidia's developer site.
I'll be presenting OpenACC.
I'll think I'll recommend the PG compiler instead of g++.
That requires recommending docker.

PAR Class 5, Thu 2020-01-30

W Randolph Franklin (WRF), Rensselaer Polytechnic Institute (RPI)

2020-01-30 00:00

Table of contents

1 Homework 3

is online here , due in a week.

2 Student talks ctd

Thu
1. Alexa, Kevin
2. Camer
3. Chris
4. Clara
5. David
6. Emily, Savan
7. Eliza
8. Garet
9. Matth, Skyla
10. MikeM
11. Zhepe
OpenMP ctd, including points listed in earlier classes but not covered yet.

PAR Homework 3, due Thu 2020-02-06, noon

W Randolph Franklin (WRF), Rensselaer Polytechnic Institute (RPI)

2020-01-30 00:00

Rules

Submit the answers to Gradescope.
You may do homeworks in teams of 2 students. Create a gradescope team and make one submission with both names.
For redundancy, at the top of your submitted homework write what it is and who you are. E.g., "Parallel Homework 2, 1/30/20, by Boris Badenov and Natasha Fatale".

Question

The goal is to measure whether OpenMP actually makes matrix multiplication faster, with and w/o SIMD.
You may use anything in /parallel-class/openmp that seems useful.
Write a C++ program on parallel.ecse to initialize pseudorandomly and multiply two 100x100 float matrices. One possible initialization:

a[i][j] = i*3 + (j*j)%5; b[i][j] = i*2 + (j*j)%7;
(10 points) Report the elapsed time. Include the program listing.
Add an OpenMP pragma to do the work in parallel.
(10 points) Report the elapsed time, varying the number of threads thus: 1, 2, 4, 8, 16, 32, 64.

What do you conclude?
(5 points) Repeat that two more times to see how consistent the times are.
(10 points) Modify the pragma to use SIMD.

Report the elapsed time, varying the number of threads thus: 1, 2, 4, 8, 16, 32, 64.

What do you conclude?
(10 points) Compile and run your program with two different levels of compiler optimization: O1 and O3, reporting the elapsed time. Modify your program to prevent the optimizer from optimizing the program away to nothing. E.g., print a few values.
(5 points) What do you conclude about everything?

Total: 40 pts.

PAR Class 4, Mon 2020-01-27

W Randolph Franklin (WRF), Rensselaer Polytechnic Institute (RPI)

2020-01-26 00:00

Table of contents

1 Student talks today

Mon
1. Hayle
2. JohnF
3. Misha
4. RossD

2 OpenMP

continuing for several classes.

PAR Syllabus

W Randolph Franklin (WRF), Rensselaer Polytechnic Institute (RPI)

2020-01-23 00:00

Table of contents

1 Catalog info

Title:	ECSE-4740-01 Applied Parallel Computing for Engineers, CRN 94829
Semesters:	Spring term annually
Credits:	3 credit hours
Time and place:	Mon and Thurs noon-1:20pm, JONSSN 4107

2 Description

This is intended to be a computer engineering course to provide students with knowledge and hands-on experience in developing applications software for affordable parallel processors. This course will mostly cover hardware that any lab can afford to purchase. It will cover the software that, in the prof's opinion, is the most useful. There will also be some theory.
The target audiences are ECSE seniors and grads and others with comparable background who wish to develop parallel software.
This course will have minimal overlap with parallel courses in Computer Science. We will not teach the IBM BlueGene, because it is so expensive, nor cloud computing and MPI, because most big data problems are in fact small enough to fit on our hardware.
You may usefully take all the parallel courses at RPI.
This unique features of this course are as follows:
1. For 2/3 of the course, use of only affordable hardware that any lab might purchase, such as Nvidia GPUs. This is currently the most widely used and least expensive parallel platform.
2. Emphasis on learning several programming packages, at the expense of theory. However you will learn a lot about parallel architecture.

Hardware taught, with reasons:

Multicore Intel Xeon:
	universally available and inexpensive, comparatively easy to program, powerful
Nvidia GPU accelerator:
	widely available (Nvidia external graphics processors are on 1/3 of all PCs), very inexpensive, powerful, but harder to program. Good cards cost only a few hundred dollars.
IBM quantum computer:
	It's new and hot and might some day be useful.

Software that might be taught, with reasons:

OpenMP C++ extension:
	widely used, easy to use if your algorithm is parallelizable, backend is primarily multicore Xeon but also GPUs.
Thrust C++ functional programming library:
	FP is nice, hides low level details, backend can be any major parallel platform.
MATLAB, Mathematica:
	easy to use parallelism for operations that they have implemented in parallel, etc.
CUDA C++ extension and library for Nvidia:
	low level access to Nvidia GPUs.

The techniques learned here will also be applicable to larger parallel machines -- numbers 1 and 2 on the top 500 list use NVIDIA GPUs. (Number 12 is a BlueGene.)
Effectively programming these processors will require in-depth knowledge about parallel programming principles, as well as the parallelism models, communication models, and resource limitations of these processors.

3 Prerequisite

ECSE-2660 CANOS or equivalent, knowledge of C++.

4 Instructors

4.1 Professor

W. Randolph Franklin. BSc (Toronto), AM, PhD (Harvard)

Informal meetings:
Office:	Jonsson Engineering Center (JEC) 6026
Phone:	+1 (518) 276-6077 (forwards)
Email:	frankwr@YOUKNOWTHEDOMAIN Email is my preferred communication medium. Sending from non-RPI accounts are fine, but please show your name, at least in the comment field. A subject prefix of #Prob is helpful. GPG encryption is fine.
Web:	https://wrf.ecse.rpi.edu/ A quick way to get there is to google RPIWRF.
Office hours:	After each lecture, usually as long as anyone wants to talk. Also by appointment.
	If you would like to lunch with me, either individually or in a group, just mention it. We can then talk about most anything legal and ethical.

4.2 Teaching assistant

Elkin Cruz, cruzce@THEUSUALDOMAIN

Office hour: Tuesday 6-8pm in the flip flop Lounge.

5 Course websites

The homepage has lecture summaries, syllabus, homeworks, etc.

6 Reading material

6.1 Text

There is no required text, but the following inexpensive books may be used. I might mention others later.

Sanders and Kandrot, CUDA by example. It gets excellent reviews, although it is several years old. Amazon has many options, including Kindle and renting hardcopies.
Kirk and Hwu, 2nd edition, Programming massively parallel processors. It concentrates on CUDA.

One problem is that even recent books may be obsolete. For instance they may ignore the recent CUDA unified memory model, which simplifies CUDA programming at a performance cost. Even if the current edition of a book was published after unified memory was released, the author might not have updated the examples.

6.2 Web

There is a lot of free material on the web, which I'll reference class by class. Because web pages are vanish so often (really!), I may cache some locally. If interested, you might start here:

https://hpc.llnl.gov/training/tutorials

7 Computer systems used

7.1 parallel.ecse

This course will primarily use (remotely via ssh) parallel.ecse.rpi.edu.

Parallel has:

dual 14-core Intel Xeon E5-2660 2.0GHz
256GB of DDR4-2133 ECC Reg memory
Nvidia GPUs: #. Quadro RTX 8000 with 48GB memory, 16 TFLOPS, and 4608 CUDA cores. This can do real time ray tracing. #. GeForce GTX 1080 with 8GB memory and 2569 CUDA cores.
Samsung Pro 850 1TB SSD
WD Red 6TB 6GB/s hard drive
CUDA 10.1
OpenMP 4.5
Thrust
Ubuntu 19.10

Material for the class is stored in /parallel-class/ .

7.2 Amazon EC2

We might also use a parallel virtual machine on the Amazon EC2. If so, you will be expected to establish an account. I expect the usage to be in the free category.

7.3 Piazza

Piazza will be available for discussion and questions.

7.4 Gradescope

Gradescope will be used for you to submit homeworks and for us to distribute grades.

The entry code for this course is 9DYRB2.

Please add yourself.

8 Assessment measures, i.e., grades

There will be no exams.
Each student will make 2 or 3 in-class presentations summarizing some relevant topic.
There will be a term project.

8.1 Homeworks

There will be some homeworks.

You may do homeworks in teams of 2 students. Create a gradescope team and make one submission with both names.

8.2 Term project

For the latter part of the course, most of your homework time will be spent on a term project.
You are encouraged do it in teams of up to 3 people. A team of 3 people would be expected to do twice as much work as 1 person.
You may combine this with work for another course, provided that both courses know about this and agree. I always agree.
If you are a grad student, you may combine this with your research, if your prof agrees, and you tell me.
You may build on existing work, either your own or others'. You have to say what's new, and have the right to use the other work. E.g., using any GPLed code or any code on my website is automatically allowable (because of my Creative Commons licence).
You will implement, demonstrate, and document something vaguely related to parallel computing.
Deliverables:
1. An implementation showing parallel computing.
2. An extended abstract or paper on your project, written up like a paper. You should follow the style guide for some major conference (I don't care which, but can point you to one).
3. A more detailed manual, showing how to use it.
4. A 2-minute project proposal given to the class around the middle of the semester.
5. A 10-minute project presentation and demo given to the class in the last week.
6. Some progress reports.
7. A write-up uploaded on the last class day. This will contain an academic paper, code and perhaps video or user manual.
Size

It's impossible to specify how many lines of code makes a good term project. E.g., I take pride in writing code that is can be simultaneously shorter, more robust, and faster than some others. See my 8-line program for testing whether a point is in a polygon: Pnpoly.

According to Big Blues, when Bill Gates was collaborating with around 1980, he once rewrote a code fragment to be shorter. However, according to the IBM metric, number of lines of code produced, he had just caused that unit to officially do negative work.

9 Early warning system (EWS)

As required by the Provost, we may post notes about you to EWS, for example, if you're having trouble doing homeworks on time, or miss an exam. E.g., if you tell me that you had to miss a class because of family problems, then I may forward that information to the Dean of Students office.

10 Academic integrity

See the Student Handbook for the general policy. The summary is that students and faculty have to trust each other. After you graduate, your most important possession will be your reputation.

Specifics for this course are as follows.

You may collaborate on homeworks, but each team of 1 or 2 people must write up the solution separately (one writeup per team) using their own words. We willingly give hints to anyone who asks.
The penalty for two teams handing in identical work is a zero for both.
You may collaborate in teams of up to 3 people for the term project.
You may get help from anyone for the term project. You may build on a previous project, either your own or someone else's. However you must describe and acknowledge any other work you use, and have the other person's permission, which may be implicit. E.g., my web site gives a blanket permission to use it for nonprofit research or teaching. You must add something creative to the previous work. You must write up the project on your own.
However, writing assistance from the Writing Center and similar sources in allowed, if you acknowledge it.
The penalty for plagiarism is a zero grade.
Cheating will be reported to the Dean of Students Office.

11 Student feedback

Since it's my desire to give you the best possible course in a topic I enjoy teaching, I welcome feedback during (and after) the semester. You may tell me or write me, or contact a third party, such as Prof James Lu, the ECSE undergrad head, or Prof John Wen, the ECSE Dept head.

PAR Class 3, Thu 2020-01-23

W Randolph Franklin (WRF), Rensselaer Polytechnic Institute (RPI)

2020-01-22 00:00

Table of contents

1 Homework 2

is online, due in a week.

2 Student talks next week

This is homework 1.

Having talks today is too rushed, so let's have all the talks next week (unless anyone wants to talk today).

As of 1/21, 10 people have picked a topic. I assigned them all to talk on Thur 1/30. The other people have been assigned to talk on Mon 1/27.

Grad students who haven't registered for a reading class and who also haven't doodled are also not listed. Add yourselves to Mon.

For privacy purposes, I've abbreved the names.

Monday speakers: please tell me your topics.

If anyone is part of a group, please tell me.

If any Thursday speaker would like to speak Mon, please tell me. You'll get first choice for dates on the next presentation.

Mon
1. Hayle
2. JohnF
3. Misha
4. RossD
Thu
1. Alexa, Kevin
2. Camer
3. Chris
4. Clara
5. David
6. Emily, Savan
7. Eliza
8. Garet
9. Matth, Skyla
10. MikeM
11. Zhepe

3 parallel.ecse

The machine is parallel.ecse.rpi.edu . You must be on campus or use the VPN to connect. Connect with ssh.

I've created accounts for everyone in ECSE-4740 but not assigned passwords (and so you cannot login). In class today I'll let you type in a password. I'll create accounts for the reading students.

Its accounts are completely separate from RCS; I used the same userid for convenience. Changing your password on parallel does not affect your RCS password, and vv.

4 ssh, afs, zfs

I recommend that you create key pairs to connect to parallel w/o typing your password each time.
To avoid retyping your ssh private key passphrase in the same shell, do this

ssh-add
One advantage of ssh is that you can mount your parallel directory on your local machine. In the files browser on my laptop, I connect to the server sftp://parallel.ecse.rpi.edu . Any other program to mount network shares would work as well.

Where the share is actually mounted varies. On my laptop, it's here: /var/run/user/1000/gvfs/sftp:host=parallel.ecse.rpi.edu,user=wrf

1000 is my uid on my laptop. Yours is not 1000.

As with any network mount, fancy filesystem things like attributes, simultaneous reading and writing from different machines, etc., probably won't work.
With ssh, you can also copy files back and forth w/o mounting or typing passwords:
1. scp localfile parallel.ecse.rpi.edu:
2. scp -r localdir parallel.ecse.rpi.edu:
3. scp parallel.ecse.rpi.edu:remotefile .
It even does filename completion on remote files.
You can also run single commands:

ssh parallel.ecse.rpi.edu hostname
In emacs, use tramp mode: Access file as /HOST:/FILE
On parallel, most filesystems use zfs.
1. Files are transparently compressed, so there's no need to explicitly use gzip etc.
2. Filesystems are automatically snapshotted, so you can often recover accidentally deleted or changed files. Look under FSROOT /.zfs/snapshot/

5 OpenMP

Versions: OpenMP is a living project, that is being updated every few years.
1. There is a lot of free online documentation, examples, and tutorials.
2. New features are regularly added, such as support for GPU backends.
3. However the documentation and implementations lags.
4. In my course, I'll use obsolete documentation when it's good enough. It's all still correct and fine for most applications.
5. Then I'll present recent some additions.
We'll review some examples running on parallel, at /parallel-class/openmp/rpi/ .
Now that we've seen some examples, look at the LLNL tutorial. https://computing.llnl.gov/tutorials/openMP/
We'll see some more examples running on parallel, at /parallel-class/openmp/rpi
We saw that running even a short OpenMP program could be unpredictable. The reason for that was a big lesson.

There are no guarantees about the scheduling of the various threads. There are no guarantees about fairness of resource allocation between the threads. Perhaps thread 0 finishes before thread 1 starts. Perhaps thread 1 finishes before thread 2 starts. Perhaps thread 0 executes 3 machine instructions, then thread 1 executes 1 instruction, then thread 0 executes 100 more, etc. Each time the process runs, something different may happen. One thing may happen almost all the time, to trick you.

This happened during the first space shuttle launch in 1981. A 1-in-70 chance event prevented the computers from synchonizing. This had been observed once before during testing. However when they couldn't get it to happen again, they ignored it.

This interlacing of different threads can happen at the machine instruction level. The C statement K=K+3 can translate into the machine instructions

LOAD K; ADD 3; STORE K.

Let's color the threads red and blue.

If the threads interlace thus:

LOAD K; ADD 3; STORE K; LOAD K; ADD 3; STORE K

then K increases by 6. If they interlace thus:

LOAD K; ADD 3; LOAD K; ADD 3; STORE K; STORE K;

then K increases by 3.
This can be really hard to debug, particularly when you don't know where this is happening.

One solution is to serialize the offending code, so that only one thread executes it at a time. The limit would be to serialize the whole program and not to use parallel processing.
OpenMP has several mechanisms to help.
A critical pragma serializes the following block. There are two considerations.
1. You lose the benefits of parallelization on that block.
2. There is an overhead to set up a critical block that I estimate might be 100,000 instructions.
An atomic pragma serializes one short C statement, which must be one of a small set of allowed operations, such as increment. It matches certain atomic machine instructions, like test-and-set. The overhead is much smaller.
If every thread is summing into a common total variable, the reduction pragma causes each thread to sum into a private subtotal, and then sum the subtotals. This is very fast.
Another lesson is that sometimes you can check your program's correctness with an independent computation. For the trivial example of summing $i^2$, use the formula

$\sum_{i=1}^N i^2 = \frac{N(N+1)(2N+1)}{6}$.

There is a lot of useful algebra if you have the time to learn it. I, ahem, learned this formula in high school.
Another lesson is that even when the program gives the same answer every time, it might still be consistently wrong.
Another is that just including OpenMP facilities, like -fopenmp, into your program slows it down even if you don't use them.
Another is that the only meaningful time metric is elapsed real time. One reason that CPU time is meaningless is the OpenMP sometimes pauses a thread's progress by making it do a CPU-bound loop. (That is also a common technique in HW.)
Also note that CPU times can vary considerably with successive executions.
Also, using too many threads might increase the real time.
Finally, floating point numbers have their own problems. They are an approximation of the mathematical concept called the real number field. That is defined by 11 axioms that state obvious properties, like

A+B=B+A (commutativity) and

A+(B+C)=(A+B+C) (associativity).

This is covered in courses on modern algebra or abstract algebra.

The problem is that most of these are violated, at least somewhat, with floats. E.g.,

$\left(10^{20}-10^{20}\right)+1=0+1=1$ but

$10^{20}+\left(-10^{20}+1\right)=10^{20}-10^{20}=0$.

Therefore when threads execute in a different order, floating results might be different.

There is no perfect solution, though using double is a start.
1. On modern CPUs, double is just as fast as float.
2. However it's slower on GPUs.
  
  How much slower can vary. Nvidia's Maxwell line spent very little real estate on double precision, so it was very slow. In contrast, Maxwell added a half-precision float, for implementing neural nets. Apparently neural nets use very few significant digits.
  
  Nvidia's newer Pascal line reverted this design choice and spends more real estate on implementing double. parallel.ecse's GPU is Pascal.
  
  Nvidia's Volta generation also does doubles well;
3. It also requires moving twice the data, and data movement is often more important than CPU time.
The large field of numerical analysis is devoted to finding solutions, with more robust and stable algorithms.

Summing an array by first sorting, and then summing the absolutely smaller elements first is one technique. Inverting a matrix by pivoting on the absolutely largest element, instead of on $a_{11}$ is another.
Another nice, albeit obsolescent and incomplete, ref: Wikibooks OpenMP.
Its tasks examples are interesting.
OpenMP tasks
1. While inside a pragma parallel, you queue up lots of tasks - they're executed as threads are available.
2. My example is tasks.cc with some variants.
3. taskfib.cc is modified from an example in the OpenMP spec.
  1. It shows how tasks can recursively spawn more tasks.
  2. This is only an example; you would never implement fibonacci that way. (The fastest way is the following closed form. In spite of the sqrts, the expression always gives an integer. )
    
    $F_n = \frac{\left( \frac{1+\sqrt{5}}{2} \right) ^n - \left( \frac{1-\sqrt{5}}{2} \right) ^n }{\sqrt{5}}$
  3. Spawning a task is expensive; we can calculate the cost.
stacksize.cc - get and set stack size. Can also do ulimit -s.
omps.cc - read assorted OMP variables.
Since I/O is not thread-safe, in hello_single_barrier.cc it has to be protected by a critical and also followed by a barrier.
OpenMP sections - my example is sections.cc with some variants.
1. Note my cool way to print an expression's name followed by its value.
2. Note the 3 required levels of pragmas: parallel, sections, section.
3. The assignment of sections to threads is nondeterministic.
4. IMNSHO, OpenMP considerably easier than pthreads, fork, etc.
http://stackoverflow.com/questions/13788638/difference-between-section-and-task-openmp

PAR Homework 2, Thu 2020-01-23

W Randolph Franklin (WRF), Rensselaer Polytechnic Institute (RPI)

2020-01-21 00:00

Rules

Due next Thurs, 2020-01-30, noon.
Submit the answers to Gradescope.
You may do homeworks in teams of 2 students. Create a gradescope team and make one submission with both names.
For redundancy, at the top of your submitted homework write what it is and who you are. E.g., "Parallel Homework 2, 1/30/20, by Boris Badenov and Natasha Fatale".

Questions

Each is 5 points.

Why is one CUDA core less powerful than one Intel Xeon core?
If 10% of a program cannot be parallelized, then what is the max speedup even with infinite parallelism?
"For short running parallel programs, there can actually be a decrease in performance compared to a similar serial implementation.". Why?
What is the difference between strong scaling and weak scaling?
What is cache coherency?
Define "embarrassingly parallel".
How many CUDA cores does the RTX 8000 have?
Why have machine cycle speeds stopped increasing?
Give an advantage and a disadvantage of shared memory versus distributed memory for parallel computation.
In OpenMP, what's the difference between ATOMIC and CRITICAL?

Total: 50 pts.

PAR Class 2, Thu 2020-01-16

W Randolph Franklin (WRF), Rensselaer Polytechnic Institute (RPI)

2020-01-16 00:00

Table of contents

1 LLNL parallel tutorial

We'll quickly finish it. https://computing.llnl.gov/tutorials/parallel_comp/

2 Student talks - 1st set

Go to

https://doodle.com/poll/68axuciyyp3avwz5. Pick a topic by entering your name and rcsid (because sometimes students have similar names and sometimes students use different names from what's on the classlist). The topics are:

Describe the fastest 2 computers on the top500 list. https://www.top500.org/lists/2018/11/

Describe computers 3 and 4 on top500.

Describe computers 5 and 6 on top500.

Describe the Blue Gene.

Summarize python's parallel capabilities.

Summarize Matlab's parallel capabilities.

Summarize Mathematica's parallel capabilities.

Summarize MPI.

Describe an application where physics needs/uses parallel computing.

Describe an application where astronomy needs/uses parallel computing.

Describe an application where biology needs/uses parallel computing.

Describe an application where astronomy needs/uses parallel computing.

Give an application where machine learning needs/uses parallel computing.

Give an application where computational fluid dynamics needs/uses parallel computing.

Summarize OpenACC.

Summarize pthreads.

More topics TBA.

Other related topic of your choice. Tell me and I'll add it to the list.

Note that any topic can be picked at most once.
Two students may work together to present one talk.
Prepare and give 5-10 minute talk next Thurs 1/23 or Mon 1/27 or Thurs 1/30. We'll spread the talks out evenly. A team would jointly give one 5-10 minute talk.

3 parallel.ecse

Its accounts are completely separate from RCS; I will use the same userid for convenience. Changing your password on parallel does not affect your RCS password, and vv.

4 OpenMP

We'll see some examples running on parallel, at /parallel-class/openmp/rpi
We saw that running even a short OpenMP program could be unpredictable. The reason for that was a big lesson.

There are no guarantees about the scheduling of the various threads. There are no guarantees about fairness of resource allocation between the threads. Perhaps thread 0 finishes before thread 1 starts. Perhaps thread 1 finishes before thread 2 starts. Perhaps thread 0 executes 3 machine instructions, then thread 1 executes 1 instruction, then thread 0 executes 100 more, etc. Each time the process runs, something different may happen. One thing may happen almost all the time, to trick you.

PAR Homework 1, Thu 2020-01-16

W Randolph Franklin (WRF), Rensselaer Polytechnic Institute (RPI)

2020-01-16 00:00

(10 points) This is the in-class presentation on Jan 27 and 30 mentioned in class.

PAR Class 1, Mon 2020-01-13

W Randolph Franklin (WRF), Rensselaer Polytechnic Institute (RPI)

2020-01-13 00:00

Table of contents

1 Material

Read the syllabus.
Read https://computing.llnl.gov/tutorials/parallel_comp/ for an intro to parallel computing.
Some points:
1. Parallel computing is decades old; there were commercial machines in the 1980s. I directed two PhD theses in parallel geometry then. However, then, clocks speeds were increasing so serial machines were more interesting.
2. Now: physics limits to processor speed.
3. History of Nvidia.
  1. Curtis Priem had designed graphics HW for both IBM and Sun Microsystems. (For awhile Sun was THE Unix workstation company. They used open standards and had the best price / performance.)
  2. Nvidia designed gaming graphics accelerators...
  3. that just happened to be parallel coprocessors...
  4. that started to be used for nongraphics parallel processing because of their value.
  5. Nvidia noticed that and added more capability, e.g., double precision IEEE floats, to serve that market.
  6. Currently, some of the highest performance Nvidia boards cannot even do graphics because they don't have video out ports.
4. Intel CPUs vs Nvidia CUDA cores.
5. Advantages and disadvantages of shared memory.
6. OpenMP vs CUDA.
7. Rate-limiting cost usually I/O not computation.
Think about these questions.
1. Why have machine cycle speeds stopped increasing?
2. What architectures do the top 3 machines on the Top 500 list use?
3. Which one of the following 4 choices are most GPUs: SISD, SIMD, MISD, MIMD.
4. Which one of the following 4 choices are most current multicore CPUs: SISD, SIMD, MISD, MIMD.
5. Per Amdahl's law, if a program is 10% sequential and 90% parallelizable, what is the max speed up that can be obtained with an infinite number of parallel processors?

2 Recently obsoleted parallel tech

Intel Xeon Phi
IBM BlueGene

3 Computer

parallel.ecse accounts
1. I'll set yours soon.
2. Play with them. I recomend connecting with ssh.

4 My research

I do parallel geometry algorithms on large problems for CAD and GIS. See my home page. If this is interesting, talk to me. Maybe you can do something that leads to a jointly-authored paper.

PAR Grad Syllabus

W Randolph Franklin (WRF), Rensselaer Polytechnic Institute (RPI)

2020-01-13 00:00

Table of contents

1 Catalog info

Title:	ECSE-6xxx Independent study: Parallel, GPU, quantum comput.
Semesters:	Spring, on demand
Credits:	3 credit hours
Time and place:	Mon and Thurs noon-1:20pm, JONSSN 4107

2 Description

This is graduate individual study course.
This is intended to be a computer engineering course to provide students with knowledge and hands-on experience in developing applications software for affordable parallel processors. This course will mostly cover hardware that any lab can afford to purchase. It will cover the software that, in the prof's opinion, is the most useful. There will also be considerable theory.
The target audience is ECSE grads and others with comparable background who wish to learn the theory and then to develop parallel software.
This course will have minimal overlap with parallel courses in Computer Science. We will not teach the IBM BlueGene, because it is so expensive, nor cloud computing and MPI, because most big data problems are in fact small enough to fit on our hardware.
You may usefully take all the parallel courses at RPI.
This unique features of this course are as follows:
1. For 2/3 of the course, use of only affordable hardware that any lab might purchase, such as Nvidia GPUs. This is currently the most widely used and least expensive parallel platform.
2. Emphasis on learning several programming packages, at the expense of theory. However you will learn a lot about parallel architecture.

Hardware taught, with reasons:

Multicore Intel Xeon:
	universally available and inexpensive, comparatively easy to program, powerful
Nvidia GPU accelerator:
	widely available (Nvidia external graphics processors are on 1/3 of all PCs), very inexpensive, powerful, but harder to program. Good cards cost only a few hundred dollars.
IBM quantum computer:
	It's new and hot and might some day be useful.

Software that might be taught, with reasons:

OpenMP C++ extension:
	widely used, easy to use if your algorithm is parallelizable, backend is primarily multicore Xeon but also GPUs.
Thrust C++ functional programming library:
	FP is nice, hides low level details, backend can be any major parallel platform.
MATLAB, Mathematica:
	easy to use parallelism for operations that they have implemented in parallel, etc.
CUDA C++ extension and library for Nvidia:
	low level access to Nvidia GPUs.

The techniques learned here will also be applicable to larger parallel machines -- numbers 1 and 2 on the top 500 list use NVIDIA GPUs. (Number 12 is a BlueGene.)
Effectively programming these processors will require in-depth knowledge about parallel programming principles, as well as the parallelism models, communication models, and resource limitations of these processors.

3 Prerequisite

ECSE-2660 CANOS or equivalent, knowledge of C++.

4 Instructors

4.1 Professor

W. Randolph Franklin. BSc (Toronto), AM, PhD (Harvard)

Informal meetings:
Office:	Jonsson Engineering Center (JEC) 6026
Phone:	+1 (518) 276-6077 (forwards)
Email:	frankwr@YOUKNOWTHEDOMAIN Email is my preferred communication medium. Sending from non-RPI accounts are fine, but please show your name, at least in the comment field. A subject prefix of #Prob is helpful. GPG encryption is fine.
Web:	https://wrf.ecse.rpi.edu/ A quick way to get there is to google RPIWRF.
Office hours:	After each lecture, usually as long as anyone wants to talk. Also by appointment.
	If you would like to lunch with me, either individually or in a group, just mention it. We can then talk about most anything legal and ethical.

4.2 Teaching assistant

Elkin Cruz, cruzce@THEUSUALDOMAIN

Office hour: TBA

5 Course websites

The homepage has lecture summaries, syllabus, homeworks, etc.

6 Reading material

6.1 Text

There is no required text, but the following inexpensive books may be used. I might mention others later.

Sanders and Kandrot, CUDA by example. It gets excellent reviews, although it is several years old. Amazon has many options, including Kindle and renting hardcopies.
Kirk and Hwu, 2nd edition, Programming massively parallel processors. It concentrates on CUDA.

One problem is that even recent books may be obsolete. For instance they may ignore the recent CUDA unified memory model, which simplifies CUDA programming at a performance cost. Even if the current edition of a book was published after unified memory was released, the author might not have updated the examples.

6.2 Web

There is a lot of free material on the web, which I'll reference class by class. Because web pages are vanish so often (really!), I may cache some locally. If interested, you might start here:

https://hpc.llnl.gov/training/tutorials

7 Computer systems used

7.1 parallel.ecse

This course will primarily use (remotely via ssh) parallel.ecse.rpi.edu.

Parallel has:

dual 14-core Intel Xeon E5-2660 2.0GHz
256GB of DDR4-2133 ECC Reg memory
Nvidia GPUs: #. Quadro RTX 8000 with 48GB memory, 16 TFLOPS, and 4608 CUDA cores. This can do real time ray tracing. #. GeForce GTX 1080 with 8GB memory and 2569 CUDA cores.
Samsung Pro 850 1TB SSD
WD Red 6TB 6GB/s hard drive
CUDA 10.1
OpenMP 4.5
Thrust
Ubuntu 19.10

Material for the class is stored in /parallel-class/ .

7.2 Amazon EC2

We might also use a parallel virtual machine on the Amazon EC2. If so, you will be expected to establish an account. I expect the usage to be in the free category.

7.3 Piazza

Piazza will be available for discussion and questions.

7.4 Gradescope

Gradescope will be used for you to submit homeworks and for us to distribute grades.

The entry code for this course is 9DYRB2.

Please add yourself.

8 Assessment measures, i.e., grades

There will be no exams.
There might be some homeworks.
Each student will make 2 or 3 in-class presentations summarizing some relevant topic.
There will be a term project that will include a major research paper.
A blog will be required.

8.1 Research blog

Students will be required to perform individual research into parallel computing principles and to record their findings as weekly entries in a personal blog, a sort of online lab book.

Possible platforms include:

Mathmatica notebook
Jupyter book
Google doc
icloud doc

8.2 Term project

For the latter part of the course, most of your homework time will be spent on a term project.
You are encouraged do it in teams of up to 3 people. A team of 3 people would be expected to do twice as much work as 1 person.
You may combine this with work for another course, provided that both courses know about this and agree. I always agree.
If you are a grad student, you may combine this with your research, if your prof agrees, and you tell me.
You may build on existing work, either your own or others'. You have to say what's new, and have the right to use the other work. E.g., using any GPLed code or any code on my website is automatically allowable (because of my Creative Commons licence).
You will implement, demonstrate, and document something vaguely related to parallel computing.
Deliverables:
1. An implementation showing parallel computing.
2. A major research paper on your project. You should follow the style guide for some major conference (I don't care which, but can point you to one).
3. A more detailed manual, showing how to use it.
4. A 2-minute project proposal given to the class around the middle of the semester.
5. A 10-minute project presentation and demo given to the class in the last week.
6. Some progress reports.
7. A write-up uploaded on the last class day. This will contain an academic paper, code and perhaps video or user manual.
Size

It's impossible to specify how many lines of code makes a good term project. E.g., I take pride in writing code that is can be simultaneously shorter, more robust, and faster than some others. See my 8-line program for testing whether a point is in a polygon: Pnpoly.

According to Big Blues, when Bill Gates was collaborating with around 1980, he once rewrote a code fragment to be shorter. However, according to the IBM metric, number of lines of code produced, he had just caused that unit to officially do negative work.

9 Early warning system (EWS)

As required by the Provost, we may post notes about you to EWS, for example, if you're having trouble doing homeworks on time, or miss an exam. E.g., if you tell me that you had to miss a class because of family problems, then I may forward that information to the Dean of Students office.

10 Academic integrity

See the Student Handbook for the general policy. The summary is that students and faculty have to trust each other. After you graduate, your most important possession will be your reputation.

Specifics for this course are as follows.

You may collaborate on homeworks, but each team of 1 or 2 people must write up the solution separately (one writeup per team) using their own words. We willingly give hints to anyone who asks.
The penalty for two teams handing in identical work is a zero for both.
You may collaborate in teams of up to 3 people for the term project.
You may get help from anyone for the term project. You may build on a previous project, either your own or someone else's. However you must describe and acknowledge any other work you use, and have the other person's permission, which may be implicit. E.g., my web site gives a blanket permission to use it for nonprofit research or teaching. You must add something creative to the previous work. You must write up the project on your own.
However, writing assistance from the Writing Center and similar sources in allowed, if you acknowledge it.
The penalty for plagiarism is a zero grade.
Cheating will be reported to the Dean of Students Office.

11 Student feedback

Since it's my desire to give you the best possible course in a topic I enjoy teaching, I welcome feedback during (and after) the semester. You may tell me or write me, or contact a third party, such as Prof James Lu, the ECSE undergrad head, or Prof John Wen, the ECSE Dept head.