# PAR Lecture 27, Thurs Apr 27

## 1   Course recap

1. My teaching style is to work from particulars to the general.
2. You've seen OpenMP, a tool for shared memory parallelism.
3. You've seen the architecture of NVidia's GPU, a widely used parallel system, and CUDA, a tool for programming it.
4. You've seen Thrust, a tool on top of CUDA, built in the C++ STL style.
5. You've seen how widely used numerical tools like BLAS and FFT have versions built on CUDA.
6. You've had a chance to program in all of them on parallel.ecse, with dual 14-core Xeons, Pascal NVidia board, and Xeon Phi coprocessor.
7. You seen talks by leaders in high performance computing, such as Jack Dongarra.
8. You've seen quick references to parallel programming using Matlab, Mathematica, and the cloud.
9. Now, you can inductively reason towards general design rules for shared and non-shared parallel computers, and for the SW tools to exploit them.

## 2   Course survey

If you liked the course, then please officially tell RPI by completing the survey. Thanks.

## 3   Term project deliverables

### 3.1   Talk in class on Monday May 1

5 minutes per team, plus 2 minutes for questions.

### 3.2   Project report due by 1159pm on Wed May 3

1. Upload it to LMS, or upload a note saying from where to retrieve it.
2. This should contain
1. an academic paper in the style of a good computing conference, or in the style of a paper published in a professional journal of the ACM or IEEE.
2. the code,
3. a video or user manual showing how to use it, and
4. a description of how it works, problems you solved, etc.

### 3.3   Optional demo to Yin Li

1. Schedule this with him.
2. If you want to do this, then spend time to make it good.
3. Try to schedule it by May 3 (although a little slippage might be ok).

### 3.4   Inspiration for finishing your term projects

1. The Underhanded C Contest

"The goal of the contest is to write code that is as readable, clear, innocent and straightforward as possible, and yet it must fail to perform at its apparent function. To be more specific, it should do something subtly evil. Every year, we will propose a challenge to coders to solve a simple data processing problem, but with covert malicious behavior. Examples include miscounting votes, shaving money from financial transactions, or leaking information to an eavesdropper. The main goal, however, is to write source code that easily passes visual inspection by other programmers."

2. The International Obfuscated C Code Contest

3. https://www.awesomestories.com/asset/view/Space-Race-American-Rocket-Failures

Moral: After early disasters, sometimes you can eventually get things to work.

4. The 'Wrong' Brothers Aviation's Failures (1920s)

5. Early U.S. rocket and space launch failures and explosion

6. Numerous US Launch Failures

## 4   Parallel computing videos

We'll see some subset of these.

### 4.1   Welcome Distributed Systems in One Lesson

niko peikrishvili, 11 min

This is the first of several short videos.

### 4.2   Paying for Lunch: C++ in the ManyCore Age

CppCon 2014: by Herb Sutter

Published on Sep 29, 2014, 1h15m

http://www.cppcon.org

Presentation Slides, PDFs, Source Code and other presenter materials are available at: https://github.com/CppCon/CppCon2014

Concurrency is one of the major focuses of C++17 and one of the biggest challenges facing C++ programmers today. Hear what this panel of experts has to say about how to write concurrent C++ now and in the future.

MODERATOR: Herb Sutter - Author, chair of the ISO C++ committee, software architect at Microsoft.

SPEAKERS:

PABLO HALPERN - Pablo Halpern has been programming in C++ since 1989 and has been a member of the C++ Standards Committee since 2007. He is currently the Parallel Programming Languages Architect at Intel Corp., where he coordinates the efforts of teams working on Cilk Plus, TBB, OpenMP, and other parallelism languages, frameworks, and tools targeted to C++, C, and Fortran users. Pablo came to Intel from Cilk Arts, Inc., which was acquired by Intel in 2009. During his time at Cilk Arts, he co-authored the paper "Reducers and other Cilk++ Hyperobjects", which won best paper at the SPAA 2009 conference. His current work is focused on creating simpler and more powerful parallel programming languages and tools for Intel's customers and promoting adoption of parallel constructs into the C++ and C standards. He lives with his family in southern New Hampshire, USA. When not working on parallel programming, he enjoys studying the viola, skiing, snowboarding, and watching opera. Twitter handle: @PabloGHalpern

JARED HOBEROCK - Jared Hoberock is a research scientist at NVIDIA where he develops the Thrust parallel algorithms library and edits the Technical Specification on Extensions for Parallelism for C++.Website: http://github.com/jaredhoberock

ARTUR LAKSBERG - Artur Laksberg leads the Visual C++ Libraries development team at Microsoft. His interests include concurrency, programming language and library design, and modern C++. Artur is one of the co-authors of the Parallel STL proposal; his team is now working on the prototype implementation of the proposal.

ADE MILLER - Ade Miller writes C++ for fun. He wrote his first N-body model in BASIC on an 8-bit microcomputer 30 years ago and never really looked back. He started using C++ in the early 90s. Recently, he's written two books on parallel programming with C++; "C++ AMP: Accelerated Massive Parallelism with Microsoft Visual C++" and "Parallel Programming with Microsoft Visual C++". Ade spends the long winters in Washington contributing to the open source C++ AMP Algorithms Library and well as a few other projects. His summers are mostly spent crashing expensive bicycles into trees. Website: http://www.ademiller.com/blogs/tech/ Twitter handle: @ademiller

GOR NISHANOV - Gor Nishanov is a is a Principal Software Design Engineer on the Microsoft C++ team. He works on the 'await' feature. Prior to joining C++ team, Gor was working on distributed systems in Windows Clustering team.

MICHAEL WONG - You can talk to me about anything including C++ (even C and that language that shall remain nameless but starts with F), Transactional Memory, Parallel Programming, OpenMP, astrophysics (where my degree came from), tennis (still trying to see if I can play for a living), travel, and the best food (which I am on a permanent quest to eat). Michael Wong is the CEO of OpenMP. He is the IBM and Canadian representative to the C++ Standard and OpenMP Committee. And did I forget to say he is a Director of ISOCPP.org and a VP, Vice-Chair of Programming Languages for Canada's Standard Council. He has so many titles, its a wonder he can get anything done. Oh, and he chairs the WG21 SG5 Transactional Memory, and is the co-author of a number C++11/OpenMP/TM features including generalized attributes, user-defined literals, inheriting constructors, weakly ordered memory models, and explicit conversion operators. Having been the past C++ team lead to IBM´s XL C++ compiler means he has been messing around with designing C++ compilers for twenty years. His current research interest, i.e. what he would like to do if he had time is in the area of parallel programming, transactional memory, C++ benchmark performance, object model, generic programming and template metaprogramming. He holds a B.Sc from University of Toronto, and a Masters in Mathematics from University of Waterloo. He has been asked to speak at ACCU, C++Now, Meeting C++, CASCON, and many Universities, research centers and companies, except his own, where he has to listen. Now he and his wife loves to teach their two children to be curious about everything.

### 4.3   Combine Lambdas and weak_ptrs to make concurrency easy

CppCon 2016: Dan Higgins, 4min

### 4.4   A Pragmatic Introduction to Multicore Synchronization

by Samy Al Bahra.

Published on Jun 15, 2016, 1h 2m.

This talk will introduce attendees to the challenges involved in achieving high performance multicore synchronization. The tour will begin with fundamental scalability bottlenecks in multicore systems and memory models, and then extend to advanced synchronization techniques involving scalable locking and lock-less synchronization. Expect plenty of hacks and real-world war stories in the fight for vertical scalability. Some of the topics introduced include memory coherence and consistency, memory organization, scalable locking, biased asymmetric synchronization, non-blocking synchronization and safe memory reclamation.

### 4.5   Synchronization - Blocking & Non-Blocking (1/2)

by Petr Kuznetsov, 15min

### 4.6   Lock-Free Programming (or, Juggling Razor Blades), Part I

CppCon 2014, by Herb Sutter

Published on Oct 16, 2014

http://www.cppcon.org

Presentation Slides, PDFs, Source Code and other presenter materials are available at: https://github.com/CppCon/CppCon2014

Example-driven talk on how to design and write lock-free algorithms and data structures using C++ atomic -- something that can look deceptively simple, but contains very deep topics. (Important note: This is not the same as my "atomic Weapons" talk; that talk was about the "what they are and why" of the C++ memory model and atomics, and did not cover how to actually use atomics to implement highly concurrent algorithms and data structures.)

# PAR Lecture 26, Mon Apr 24

## 1   Green Island bridge collapse 1977-03-16

(not related to course) importance of good engineering. story, image.

## 2   Debugging videos

### 2.1   uftrace: A function graph tracer for C/C++ userspace programs

6min

Published on Oct 7, 2016

http://CppCon.org

Presentation Slides, PDFs, Source Code and other presenter materials are available at: https://github.com/cppcon/cppcon2016

This is one of the CppCon 2016 Lightning Talks.

### 2.2   Concepts in 5

CppCon 2016: David Sankel

## 3   Programming videos

### 3.1   Modern C++: What You Need to Know

by Paulo Portela

Published on Apr 7, 2014, 1h

This talk will give an update on recent progress and near-future directions for C++, both at Microsoft and across the industry. This is a great introduction to the current state of the language, including a glimpse into the future of general purpose, performance-intensive, power-friendly, powerful native programming.

## 4   Parallel computing videos

### 4.1   Paying for Lunch: C++ in the ManyCore Age

CppCon 2014: by Herb Sutter

Published on Sep 29, 2014, 1h15m

http://www.cppcon.org

Presentation Slides, PDFs, Source Code and other presenter materials are available at: https://github.com/CppCon/CppCon2014

Concurrency is one of the major focuses of C++17 and one of the biggest challenges facing C++ programmers today. Hear what this panel of experts has to say about how to write concurrent C++ now and in the future.

MODERATOR: Herb Sutter - Author, chair of the ISO C++ committee, software architect at Microsoft.

SPEAKERS:

PABLO HALPERN - Pablo Halpern has been programming in C++ since 1989 and has been a member of the C++ Standards Committee since 2007. He is currently the Parallel Programming Languages Architect at Intel Corp., where he coordinates the efforts of teams working on Cilk Plus, TBB, OpenMP, and other parallelism languages, frameworks, and tools targeted to C++, C, and Fortran users. Pablo came to Intel from Cilk Arts, Inc., which was acquired by Intel in 2009. During his time at Cilk Arts, he co-authored the paper "Reducers and other Cilk++ Hyperobjects", which won best paper at the SPAA 2009 conference. His current work is focused on creating simpler and more powerful parallel programming languages and tools for Intel's customers and promoting adoption of parallel constructs into the C++ and C standards. He lives with his family in southern New Hampshire, USA. When not working on parallel programming, he enjoys studying the viola, skiing, snowboarding, and watching opera. Twitter handle: @PabloGHalpern

JARED HOBEROCK - Jared Hoberock is a research scientist at NVIDIA where he develops the Thrust parallel algorithms library and edits the Technical Specification on Extensions for Parallelism for C++.Website: http://github.com/jaredhoberock

ARTUR LAKSBERG - Artur Laksberg leads the Visual C++ Libraries development team at Microsoft. His interests include concurrency, programming language and library design, and modern C++. Artur is one of the co-authors of the Parallel STL proposal; his team is now working on the prototype implementation of the proposal.

ADE MILLER - Ade Miller writes C++ for fun. He wrote his first N-body model in BASIC on an 8-bit microcomputer 30 years ago and never really looked back. He started using C++ in the early 90s. Recently, he's written two books on parallel programming with C++; "C++ AMP: Accelerated Massive Parallelism with Microsoft Visual C++" and "Parallel Programming with Microsoft Visual C++". Ade spends the long winters in Washington contributing to the open source C++ AMP Algorithms Library and well as a few other projects. His summers are mostly spent crashing expensive bicycles into trees. Website: http://www.ademiller.com/blogs/tech/ Twitter handle: @ademiller

GOR NISHANOV - Gor Nishanov is a is a Principal Software Design Engineer on the Microsoft C++ team. He works on the 'await' feature. Prior to joining C++ team, Gor was working on distributed systems in Windows Clustering team.

MICHAEL WONG - You can talk to me about anything including C++ (even C and that language that shall remain nameless but starts with F), Transactional Memory, Parallel Programming, OpenMP, astrophysics (where my degree came from), tennis (still trying to see if I can play for a living), travel, and the best food (which I am on a permanent quest to eat). Michael Wong is the CEO of OpenMP. He is the IBM and Canadian representative to the C++ Standard and OpenMP Committee. And did I forget to say he is a Director of ISOCPP.org and a VP, Vice-Chair of Programming Languages for Canada's Standard Council. He has so many titles, its a wonder he can get anything done. Oh, and he chairs the WG21 SG5 Transactional Memory, and is the co-author of a number C++11/OpenMP/TM features including generalized attributes, user-defined literals, inheriting constructors, weakly ordered memory models, and explicit conversion operators. Having been the past C++ team lead to IBM´s XL C++ compiler means he has been messing around with designing C++ compilers for twenty years. His current research interest, i.e. what he would like to do if he had time is in the area of parallel programming, transactional memory, C++ benchmark performance, object model, generic programming and template metaprogramming. He holds a B.Sc from University of Toronto, and a Masters in Mathematics from University of Waterloo. He has been asked to speak at ACCU, C++Now, Meeting C++, CASCON, and many Universities, research centers and companies, except his own, where he has to listen. Now he and his wife loves to teach their two children to be curious about everything.

### 4.2   Combine Lambdas and weak_ptrs to make concurrency easy

CppCon 2016: Dan Higgins, 4min

### 4.3   A Pragmatic Introduction to Multicore Synchronization

by Samy Al Bahra.

Published on Jun 15, 2016, 1h 2m.

This talk will introduce attendees to the challenges involved in achieving high performance multicore synchronization. The tour will begin with fundamental scalability bottlenecks in multicore systems and memory models, and then extend to advanced synchronization techniques involving scalable locking and lock-less synchronization. Expect plenty of hacks and real-world war stories in the fight for vertical scalability. Some of the topics introduced include memory coherence and consistency, memory organization, scalable locking, biased asymmetric synchronization, non-blocking synchronization and safe memory reclamation.

### 4.4   Synchronization - Blocking & Non-Blocking (1/2)

by Petr Kuznetsov, 15min

### 4.5   Lock-Free Programming (or, Juggling Razor Blades), Part I

CppCon 2014, by Herb Sutter

Published on Oct 16, 2014

http://www.cppcon.org

Presentation Slides, PDFs, Source Code and other presenter materials are available at: https://github.com/CppCon/CppCon2014

Example-driven talk on how to design and write lock-free algorithms and data structures using C++ atomic -- something that can look deceptively simple, but contains very deep topics. (Important note: This is not the same as my "atomic Weapons" talk; that talk was about the "what they are and why" of the C++ memory model and atomics, and did not cover how to actually use atomics to implement highly concurrent algorithms and data structures.)

# PAR Lecture 25, Thu Apr 20

Today we'll see some Nvidia videos from recent conferences, then several cloud computing tools, then some more Intel example programs.

## 2   Cloud computing

The material is from Wikipedia, which appeared better than any other sources that I could find.

1. Hierarchy:

1. IaaS (Infrastructure as a Service)
1. Sample functionality: VM, storage
2. Examples:
2. Amazon_Web_Services
3. OpenStack : compute, storage, networking, dashboard
2. PaaS (Platform ...)
1. Sample functionality: OS, Web server, database server
2. Examples:
1. OpenShift
2. Cloud_Foundry
1. distributed FS, Map Reduce
2. derived from Google FS, map reduce
3. Now, people often run Apache Spark™ - Lightning-Fast Cluster Computing instead of Hadoop, because Spark is faster.
3. SaaS (Software ...)
1. Sample functionality: email, gaming, CRM, ERP
2. Cloud_computing_comparison

3. Virtual machine.

The big question is, at what level does the virtualization occur? Do you duplicate the whole file system and OS, even emulate the HW, or just try to isolate files and processes in the same OS.

4. Distributed storage

1. VNC
2. Grid_computing
1. decentralized, heterogeneous
2. used for major projects like protein folding

## 3   More Intel tutorials

Today, following on to Lecture 23, we'll run some more Intel tutorials. Students with their laptops are encouraged to participate.

Setting things up in your home dir (or whereever):

1. mkdir samples; cd samples
2. source /parallel-class/mic/setup

For each sample,

1. Get the tarball into your dir with wget. (I copied some tarballs into /opt/intel/samples/ ).
2. Unpack it.
3. cd into the subdir

Here are the samples.

# PAR Lecture 24, Mon Apr 17

Today we'll see some programming tips and then several parallel computing tools.

## 1   Software tips

### 1.1   Freeze decisions early: SW design paradigm

One of my rules is to push design decisions to take effect as early in the process execution as possible. Constructing variables at compile time is best, at function call time is 2nd, and on the heap is worst.

1. If I have to construct variables on the heap, I construct few and large variables, never many small ones.

2. Often I compile the max dataset size into the program, which permits constructing the arrays at compile time. Recompiling for a larger dataset is quick (unless you're using CUDA).

Accessing this type of variable uses one less level of pointer than accessing a variable on the heap. I don't know whether this is faster with a good optimizing compiler, but it's probably not slower.

3. If the data will require a dataset with unpredictably sized components, such as a ragged array, then I may do the following.

1. Read the data once to accumulate the necessary statistics.
2. Construct the required ragged array.
3. Reread the data and populate the array.

X over ssh is very slow.

Here are some things I've discovered that help.

1. Use a faster, albeit less secure, cipher:

ssh -c arcfour,blowfish-cbc -C  parallel.ecse.rpi.edu


(This does not work yet.)

2. Use xpra; here's an example:

1. On parallel.ecse:

xpra start :77; DISPLAY=:77 xeyes&


Don't everyone use 77, pick your own numbers in the range 20-99.

2. On server, i.e., your machine:

xpra attach ssh:parallel.ecse.rpi.edu:77

3. I suspect that security is weak. When you start an xpra session, In suspect that anyone on parallel.ecse can display to it. I suspect that anyone with ssh access to parallel.ecse can try to attach to it, and the that 1st person wins.

3. Use nx, which needs a server, e.g., FreeNX.

## 2   Jack Dongarra videos

1. Sunway TaihuLight's strengths and weaknesses highlighted. 9 min. 8/21/2016.

This is the new fastest known machine on top500. A machine with many Intel Xeon Phi coprocessors is now 2nd, Nvidia K20 is 3rd, and some machine built by a company down the river is 4th. These last 3 machines have been at the top for a surprisingly long time.

2. An Overview of High Performance Computing and Challenges for the Future. 57min, 11/16/2016.

We saw the 1st 18 min in class.

## 3   More parallel tools

### 3.1   cuFFT Notes

1. GPU Computing with CUDA Lecture 8 - CUDA Libraries - CUFFT, PyCUDA from Christopher Cooper, BU
2. video #8 - CUDA 5.5 cuFFT FFTW API Support. 3 min.
3. cuFFT is inspired by FFTW (the fastest Fourier transform in the west), which they say is so fast that it's as fast as commercial FFT packages.
4. I.e., sometimes commercial packages may be worth the money.
5. Although the FFT is taught for N a power of two, users often want to process other dataset sizes.
6. The problem is that the optimal recursion method, and the relevant coefficients, depends on the prime factors of N.
7. FFTW and cuFFT determine the good solution procedure for the particular N.
8. Since this computation takes time, they store the method in a plan.
9. You can then apply the plan to many datasets.
10. If you're going to be processing very many datasets, you can tell FFTW or cuFFT to perform sample timing experiments on your system, to help in devising the best plan.
11. That's a nice strategy that some other numerical SW uses.
12. One example is Automatically Tuned Linear Algebra Software (ATLAS).

### 3.2   cuBLAS etc Notes

1. BLAS is an API for a set of simple matrix and vector functions, such as multiplying a vector by a matrix.
2. These functions' efficiency is important since they are the basis for widely used numerical applications.
3. Indeed you usually don't call BLAS functions directly, but use higher-level packages like LAPACK that use BLAS.
4. There are many implementations, free and commercial, of BLAS.
5. cuBLAS is one.
6. One reason that Fortran is still used is that, in the past, it was easier to write efficient Fortran programs than C or C++ programs for these applications.
7. There are other, very efficient, C++ numerical packages. (I can list some, if there's interest).
8. Their efficiency often comes from aggressively using C++ templates.
9. Matrix mult example

### 3.3   Matlab

1. Good for applications that look like matrices.

Considerable contortions required for, e.g., a general graph. You'd represent that with a large sparse adjacency matrix.

2. Using explicit for loops is slow.

3. Efficient execution when using builtin matrix functions,

but can be difficult to write your algorithm that way, and

4. Very expensive and getting more so.

Many separately priced apps.

5. Uses state-of-the-art numerical algorithms.

E.g., to solve large sparse overdetermined linear systems.

Better than Mathematica.

6. Most or all such algorithms also freely available as C++ libraries.

However, which library to use?

Complicated calling sequences.

Obscure C++ template error messages.

7. Graphical output is mediocre.

Mathematica is better.

8. Various ways Matlab can execute in parallel

1. Operations on arrays can execute in parallel.

E.g. B=SIN(A) where A is a matrix.

2. Automatic multithreading by some functions

Various functions, like INV(a), automatically use perhaps 8 cores.

The '8' is a license limitation.

Which MATLAB functions benefit from multithreaded computation?

3. PARFOR

However, FOR is slow.

Many restrictions, e.g., cannot be nested.

Start pools first with: MATLABPOOL OPEN 12

Can do reductions.

4. Parallel Computing Server

This runs on a parallel machine, including Amazon EC2.

Your client sends batch or interactive jobs to it.

Many Matlab toolboxes are not licensed to use it.

This makes it much less useful.

5. GPU computing

Create an array on device with gpuArray

Run builtin functions on it.

Matlab's run built in functions on a gpu

### 3.4   Mathematica in parallel

You terminate an input command with shift-enter.

Some Mathematica commands:

Sin[1.]
Plot[Sin[x],{x,-2,2}]
a=Import[
"/opt/parallel/mathematica/mtn1.dat"]
Information[a]
Length[a]
b=ArrayReshape[a,{400,400}]
MatrixPlot[b]
ReliefPlot[b]
ReliefPlot[MedianFilter[b,1]]
Dimensions[b]
Eigenvalues[b]   *When you get bored*
* waiting, type * alt-.
Eigenvalues[b+0.0]
Table[ {x^i y^j,x^j y^i},{i,2},{j,2}]
Flatten[Table[ {x^i y^j,x^j y^i},{i,2},{j,2}],1]
StreamPlot[{x*y,x+y},{x,-3,3},{y,-3,3}]
$ProcessorCount$ProcessorType
*Select *Parallel Kernel Configuration*
and *Status* in the *Evaluation* menu*
ParallelEvaluate[$ProcessID] PrimeQ[101] Parallelize[Table[PrimeQ[n!+1],{n,400,500}]] merQ[n_]:=PrimeQ[2^n-1] Select[Range[5000],merQ] ParallelSum[Sin[x+0.],{x,0,100000000}] Parallelize[ Select[Range[5000],merQ]] Needs["CUDALink"] *note the back quote* CUDAInformation[] Manipulate[n, {n, 1.1, 20.}] Plot[Sin[x], {x, 1., 20.}] Manipulate[Plot[Sin[x], {x, 1., n}], {n, 1.1, 20.}] Integrate[Sin[x]^3, x] Manipulate[Integrate[Sin[x]^n, x], {n, 0, 20}] Manipulate[{n, FactorInteger[n]}, {n, 1, 100, 1}] Manipulate[Plot[Sin[a x] + Sin[b x], {x, 0, 10}], {a, 1, 4}, {b, 1, 4}]  Unfortunately there's a problem that I'm still debugging with the Mathematica - CUDA interface. # PAR Lecture 23, Thu Apr 13 Table of contents ## 1 Git Git is good to simultaneously keep various versions. A git intro: Create a dir for the project: mkdir PROJECT; cd PROJECT  Initialize: git init  Create a branch (you can do this several times): git branch MYBRANCHNAME  Go to a branch: git checkout MYBRANCHNAME  Do things: vi, make, ....  Save it: git add .; git commit -mCOMMENT  Repeat We'll test this on /parallel-class/mic/matmul3.cc for the Xeon and the Xeon Phi. ## 2 Intel tutorials Today we'll run some Intel tutorials. Students with their laptops are encouraged to participate. Setting things up in your home dir (or whereever): 1. mkdir samples; cd samples 2. source /parallel-class/mic/setup For each sample, 1. Get the tarball into your dir with wget 2. Unpack it. 3. cd into the subdir 4. Browse readme.html and follow it. Here are the samples. 1. Intel® C++ Compiler - Using Auto-Vectorization Tutorial 2. Intel® C++ Xeon Phi™ sample 1. LEO_tutorial 2. Sample_c: several programs. 3. Tachyon samples with various backends. # PAR Lecture 22, Mon Apr 10 Table of contents ## 1 Intel and OpenMP Videos # PAR Lecture 21, Thurs Apr 6 Table of contents ## 1 OpenMP on the mic 1. OpenMP is now running on the mic (Xeon Phi). 2. Setup envars thus (assuming you're using bash or zsh): source /opt/intel/bin/iccvars.sh arch intel64 export SINK_LD_LIBRARY_PATH=/opt/intel/lib/mic  3. To compile on parallel for running on parallel: icpc -fopenmp sum_reduc2.cc -o sum_reduc2  4. Run it on parallel thus: ./sum_reduc2  5. To cross compile on parallel for running on mic0: icpc -mmic -fopenmp sum_reduc2.cc -o sum_reduc2-mic  6. Run it natively on mic0 thus: micnativeloadex sum_reduc2-mic  7. It is also possible to have an OpenMP program on parallel.ecse execute parallel threads on mic0: #pragma offload target (mic) { #pragma omp parallel { ... } }  See /parallel-class/mic/stackoverflow/hello.c 8. Compile it thus: icc -fopenmp hello.c -o hello  Note that there is no -mmic. 9. Run it thus: ./hello  10. /parallel-class/mic/stackoverflow/hello_omp.c shows a different syntax. 11. Degrees of parallelism: See slide 12 of https://software.intel.com/sites/default/files/managed/87/c3/using-nested-parallelism-in-openMP-r1.pdf # PAR Homework 7 Hand in your solution on RPILMS, unless instructions say otherwise. Each team should submit its solution under only 1 student's name. The other student's submission should just name the lead student. (This makes it easier for us to avoid grading it twice.) If you have problems, then ask for help. The goal is to learn the material. ## Homework 7, due Mon 2017-04-10 1. I forgot to say to submit a written project proposal (in addition to giving the talk). So, if you haven't, submit one. 2. Read Rebooting Computing The Road Ahead in the Jan 2017 IEEE Computing. Write 100 words on how they propose to get more computing done in the near future. 3. Play with your earlier programs with icpc and on the MIC and report any observations. # PAR Lecture 20, Mon Apr 6 Table of contents ## 1 Intel compilers on parallel 1. Intel Parallel Studio XE Cluster 2017 is now installed on parallel, in /opt/intel/ . It is a large package with compilers, debuggers, analyzers, MPI, etc, etc. There is is extensive doc on Intel's web site. Have fun. Students and free SW developers can get also free licenses for their machines. Commercial licenses cost$thousands.

2. What’s Inside Intel Parallel Studio XE 2017. There're PDF slides, a webinar, and training stuff.

3. https://goparallel.sourceforge.net/want-faster-code-faster-check-parallel-studio-xe/

4. Before using the compiler, you should initialize some envars thus:

source /opt/intel/bin/iccvars.sh -arch intel64

5. Then you can compile a C or C++ program thus:

icc -openmp -O3 foo.c -o foo
icpc -qopenmp -O3 foo.cc -o foo

6. On my simple tests, not using the mic, icpc and g++ produced equally good code.

7. Compile a C++ program with OpenMP thus:

icpc -qopenmp -std=c++11    omp_hello.cc   -o omp_hello

8. Test it thus, it is in /parallel-class/mic

OMP_NUM_THREADS=4 ./omp_hello


Note how the output from the various threads is mixed up.

## 2   Programming the MIC (ctd)

1. It turns out that I (but not you) can update a login password on mic0, but it's a little tedious. Use your ssh key.

Details: at startup, mic0:/etc is initialized from parallel:/var/mpss/mic0/etc So I could edit shadow and insert a new encrypted password.

So is /home, but it's then replaced by the NFS mount.

2. I can also change, e.g., your login shell. Use bash on mic0 since zsh does not exist there.

3. Dr.Dobb's: Programming the Xeon Phi by Rob Farber.

4. Some MIC demos .

5. To cross compile with icc and icpc, see the MPSS users guide, Section 8.1.4. Use the -mmic flag.

6. Intel OpenMP* Support Overview.

8. Rolling Out the New Intel Xeon Phi Processor at ISC 2016 https://www.youtube.com/watch?v=HDPYymREyV8

9. Supermicro Showcases Intel Xeon Phi and Nvidia P100 Solutions at ISC 2016 https://www.youtube.com/watch?v=nVWqSjt6hX4

10. Insider Look: New Intel® Xeon Phi™ processor on the Cray® XC™ Supercomputer https://www.youtube.com/watch?v=lkf3U_5QG_4

# PAR Lecture 19, Thu Mar 30

## 1   Intel Xeon Phi 7120A

### 1.1   In general

1. The Xeon Phi is Intel's brand name for their MIC (for Many Integrated Core Architecture).

2. The 7120a is Intel's Knights Landing (1st generation) MIC architecure, launched in 2014.

Having several threads per core helps to overcome latency in fetching stuff.

4. It has 16GB of memory accessible at 352 GB/s, 30BM L2 cache, and peaks at 1TFlops double precision.

5. It is a coprocessor on a card accessible from a host CPU on a local network.

6. It is intended as a supercomputing competitor to Nvidia.

7. The mic architecture is quite similar to the Xeon.

8. However executables from one don't run on the other, unless the source was compiled to include both versions in the executable file.

9. The mic has been tuned to emphasize floating performance at the expense of, e.g., speculative execution.

This helps to make it competitive with Nvidia, even though Nvidia GPUs can have many more cores.

10. Its OS is busybox, an embedded version of linux.

11. The SW is called MPSS (Manycore Platform Software Stack).

12. The mic can be integrated with the host in various ways that I haven't (yet) implemented.

1. Processes on the host can execute subprocesses on the device, as happens with Nvidia CUDA.
2. E.g., OpenMP on the host can run parallel threads on the mic.
3. The mic can page virtual memory from the host.
13. The fastest machine on top500.org a few years ago used Xeon Phi cards.

The 2nd used Nvidia K20 cards, and the 3rd fastest was an IBM Bluegene.

So, my course lets you use the 2 fastest architectures, and there's another course available at RPI for the 3rd.

14. Information:

### 1.2   parallel.ecse's mic

1. The hostname (of this particular MIC) is parallel-mic0 or mic0.

2. The local filesystem is in RAM and is reinitialized when mic0 is rebooted.

3. Parallel:/home and /parallel-class are NFS exported to mic0.

4. /home can be used to move files back and forth.

5. All the user accounts on parallel were given accounts on mic0.

6. You can ssh to mic0 from parallel.

7. Your current parallel ssh key pair should work.

(The mic0 setup snapshotted parallel's accounts and created a read-only image to boot mic0 from. Any changes to mic0:/etc/shadow are reverted when mic0 reboots.)

### 1.3   Programming the mic

1. Parallel:/parallel-class/mic/bin has versions of gcc, g++, etc, with names like k1om-mpss-linux-g++ .

2. These run on parallel and produce executable files that run (only) on mic0.

3. Here's an example of compiling (on parallel) a C program in /parallel-class/mic

bin/k1om-mpss-linux-gcc hello.c -o hello-mic

4. Run it thus from parallel (it runs on mic0):

ssh mic0  /parallel-class/mic/hello-mic


# PAR Lecture 18, Mon Mar 27

## 1   Thrust

### 1.1   Examples

1. I rewrote /parallel-class/thrust/examples-1.8/tiled_range.cu into /parallel-class/thrust/rpi/tiled_range2.cu .

It is now much shorter and much clearer. All the work is done here:

gather(make_transform_iterator(make_counting_iterator(0), _1%N), make_transform_iterator(make_counting_iterator(N*C), _1%N), data.begin(), V.begin());

1. make_counting_iterator(0) returns pointers to the sequence 0, 1, 2, ...
2. _1%N is a function computing modulo N.
3. make_transform_iterator(make_counting_iterator(0), _1%N) returns pointers to the sequence 0%N, 1%N, ...
4. gather populates V. The i-th element of V gets make_transform_iterator...+i element of data, i.e., the i%N-th element of data.
2. tiled_range3.cu is even shorter. Instead of writing an output vector, it constructs an iterator for a virtual output vector:

auto output=make_permutation_iterator(data, make_transform_iterator(make_counting_iterator(0), _1%N));

1. *(output+i) is *(data+(i%N)).
2. You can get as many tiles as you want by iterating.
3. tiled_range3.cu also constructs an iterator for a virtual input vector (in this case a vector of squares) instead of storing the data:

auto data = make_transform_iterator(make_counting_iterator(0), _1*_1);

3. tiled_range5.cu shows how to use a lambda instead of the _1 notation:

auto output=make_permutation_iterator(data, make_transform_iterator(make_counting_iterator(0), [](const int i){return i%N;} ));

1. You have to compile with --std c++11 .

2. This can be rewritten thus:

auto f = [](const int i){return i%N;}; auto output = make_permutation_iterator(data, make_transform_iterator(make_counting_iterator(0), f));

3. The shortest lambda is this:

auto f = [](){};

4. repeated_range2.cu is my improvement on repeated_range.cu:

auto output=make_permutation_iterator(data.begin(), make_transform_iterator(make_counting_iterator(0), _1/3));

1. make_transform_iterator(make_counting_iterator(0), _1/3)) returns pointers to the sequence 0,0,0,1,1,1,2,2,2, ...
5. Unmodified thrust examples:

1. expand.cu takes a vector like V= [0, 10, 20, 30, 40] and a vector of repetition counts, like C= [2, 1, 0, 3, 1]. Expand repeats each element of V the appropriate number of times, giving [0, 0, 10, 30, 30, 30, 40]. The process is as follows.

1. Since the output vector will be longer than the input, the main program computes the output size, byt reduce summing C, and constructs a vector to hold the output.
2. Exclusive_scan C to obtain output offsets for each input element: C2 = [0, 2, 3, 3, 6].
3. Scatter_if the nonzero counts into their corresponding output positions. A counting iterator, [0, 1, 2, 3, 4] is mapped with C2, using C as the stencil, giving C3 = [0, 0, 1, 3, 0, 0, 4].
4. An inclusive_scan with max fills in the holes in C3, to give C4 = [0, 0, 1, 3, 3, 3, 4].
5. Gather uses C4 to gather elements of V: [0, 0, 10, 30, 30, 30, 40].
2. set_operations.cu. This shows methods of handling an operation whose output is of unpredictable size. The question is, is space or time more important?

1. If the maximum possible output size is reasonable, then construct an output vector of that size, use it, and then erase it down to its actual size.

2. Or, run the operation twice. The 1st time, write to a discard_iterator, and remember only the size of the written data. Then, construct an output vector of exactly the right size, and run the operation again.

I use this technique a lot with ragged arrays in sequential programs.

3. sparse_vector.cu represents and sums sparse vectors.

1. A sparse vector has mostly 0s.

2. The representation is a vector of element indices and another vector of values.

3. Adding two sparse vectors goes as follows.

1. Allocate temporary index and element vectors of the max possible size (the sum of the sizes of the two inputs).

2. Catenate the input vectors.

3. Sort by index.

4. Find the number of unique indices by applying inner_product with addition and not-equal-to-next-element to the indices, then adding one.

E.g., applied to these indices: [0, 3, 3, 4, 5, 5, 5, 8], it gives 5.

5. Allocate exactly enough space for the output.

6. Apply reduce_by_key to the indices and elements to add elements with the same keys.

The size of the output is the number of unique keys.

6. What's the best way to sort 16000 sets of 1000 numbers each? E.g., sort the rows of a 16000x1000 array? On geoxeon, @@/pc/thrust/rpi/comparesorts.cu@@, which I copied from http://stackoverflow.com/questions/28150098/how-to-use-thrust-to-sort-the-rows-of-a-matrix|stackoverflow, compares three methods.

7. Call the thrust sort 16000 times, once per set. That took 10 secs.

8. Sort the whole list of 16,000,000 numbers together. Then sort it again by key, with the keys being the set number, to bring the elements of each set together. Since the sort is stable, this maintains the order within each set. (This is also how radix sort works.) That took 0.04 secs.

9. Call a thrust function (to sort a set) within another thrust function (that applies to each set). This is new in Thrust 1.8. That took 0.3 secs.

This is a surprising and useful paradigm. It works because

1. There's an overhead to starting each thrust function, and
2. Radix sort, which thrust uses for ints, takes linear time.

### 1.2   Backends

1. The Thrust device can be CUDA, OpenMP, TBB, etc.

2. You can spec it in 2 ways:

1. by adding an extra arg at the start of a function's arg list.

2. with an envar

https://github.com/thrust/thrust/wiki/Host-Backends

https://github.com/thrust/thrust/wiki/Device-Backends

# PAR Lecture 17, Thu Mar 23

## 1   Term project proposal talks

Each term project team gives a 5-minute proposal talk in class this Thurs. I'll let you choose the order.

## 2   Linux programming tip

I like to use emacs on my local machine to edit files on parallel.ecse. It's more responsive than running a remote editor, and lets me copy to and from local files. Here are 2 ways to do that.

1. In emacs on my local machine, I do find-file thus:

/parallel.ecse.rpi.edu:/parallel-class/thrust/rpi/tiled_range2.cu

Emacs transparently runs scp to read and write the remote file.

2. I can mount remote directories on my local machine with something like ssh-fs and then edit remote files as if they were local.

In either case, I have to compile the files on parallel.

# PAR Lecture 16, Mon Mar 20

## 1   Term project proposal talk on Thurs

Remember that, as part of homework 6, each term project team is required to give a 5-minute proposal talk in class this Thurs. I'll let you choose the order.

## 2   OpenMP: CPU time vs wall clock (elapsed) time

On parellel.ecse, /parallel-class/openmp/rpi/cpuvs.wall.cc is a simple program that shows how wallclock time shrinks as the number of threads rises (up to a point) but CPU time rises from the start.

That is because one way that a thread waits to run is to burn cycles. For small delays, this is easier and has less overhead than explicitly suspending. The GNU version of OpenMP lets you choose the crossover point between buring cycles and syspending a thread.

## 3   Rebooting parallel.ecse

As part of getting the Xeon Phi coprocessor working, I'll need to reboot parallel from time to time. That takes a few minutes and you're welcome to connect again when it comes up.

I have to do this when I'm physically at the machine.

I'll try to do this when it's not actively being used, and will try to send a message to connected people.

## 4   Thrust

The intellectual part of today's lecture will continue Thrust, following my Thrust notes in lecture 14.

We'll also look at /parallel-class/thrust/doc/An_Introduction_To_Thrust.pdf and GTC_2010_Part_2_Thrust_By_Example.pdf in more detail.

# PAR Homework 6

Hand in your solution on RPILMS, unless instructions say otherwise. Each team should submit its solution under only 1 student's name. The other student's submission should just name the lead student. (This makes it easier for us to avoid grading it twice.)

If you have problems, then ask for help. The goal is to learn the material.

## Homework 6, due Thu 2017-03-23 in class

1. Form teams of up to 3 students and decide on term project topics.
1. Any (legal, ethical) topic that shows that you learned something in class is ok.
2. Combining with other courses and research projects is ok (if the other side agrees).
2. Each team is encouraged to create a blog for the deliverables, like the proposal.
1. I don't care if it's password-protected or not, if I can read it.
2. If you want to do this but don't have a platform, I can create a PmWiki site for you.
3. Each team give a 5-minute talk to class, plus 1 or 2 minutes for questions.
1. You choose the medium.
2. It does not have to be a fast-forward powerpoint presentation.
3. However you are responsible for getting your presentation working; this time comes from your 5 minutes.
4. So that the class ends on time, I will cut you off if you run too long.

# PAR Lecture 14, Mon Mar 6

## 1   Web site

1. I added Disqus comments. This is a way for you to ask questions and discuss thing with other class members.
2. I revised the tag categories. Now homeworks and lectures can be listed separately.

## 2   parallel.ecse hardware details

I put the invoice on parallel.ecse in /parallel-class/ . It gives the hardware specifics.

## 3   Term project

Summary in syllabus.

## 4   Student presentations about GTC

1. JA, the Shaping Light with GPUs
2. LS, High-Level GPU Programming Using OpenMP 4.5 and Clang/LLVM
3. SKL, GPU-Accelerated Motion Input Analysis for Motion Reconstruction
4. WB
5. BC
6. FC
7. AF
8. AG
9. BL
10. TL

## 5   Nvidia GPU summary

Here's a summary of the Nvidia Pascal GP104 GPU architecture as I understand it. It's more compact than I've found elsewhere. I'll add to it from time to time. Some numbers are probably wrong.

1. The host is the CPU.

2. The device is the GPU.

3. The device contains 20 streaming multiprocessors (SMs).

Different GPU generations have used the terms SMX or SMM.

4. A thread is a sequential program with private and shared memory, program counter, etc.

5. Threads are grouped, 32 at a time, into warps.

6. Warps of threads are grouped into blocks.

Often the warps are only implicit, and we consider that the threads are grouped directly into blocks.

That abstract hides details that may be important; see below.

7. Blocks of threads are grouped into a grid, which is all the threads in the kernel.

8. A kernel is a parallel program executing on the device.

1. The kernel runs potentially thousands of threads.
2. A kernel can create other kernels and wait for their completion.
3. There may be a limit, e.g., 5 seconds, on a kernel's run time.

1. Each thread can use up to 255 fast registers. Registers are private to the thread.

All the threads in one block have their registers allocated from a fixed pool of 65536 registers. The more registers that each thread uses, the fewer warps in the block can run simultaneously.

2. Each thread has 512KB slow local memory, allocated from the global memory.

3. Local memory is used when not enough registers are available, and to store thread-local arrays.

10. Warp-level resources:

1. Threads are grouped, 32 at a time, into warps.

2. Each warp executes as a SIMD, with one instruction register. At each cycle, every thread in a warp is either executing the same instruction, or is disabled. If the 32 threads want to execute 32 different instructions, then they will execute one after the other, sequentially.

If you read in some NVidia doc that threads in a warp run independently, then continue reading the next page to get the info mentioned in the previous paragraph.

3. If successive instructions in a warp do not depend on each other, then, if there are enough warp schedulers available, they may be executed in parallel. This is called Instruction Level Parallelism (ILP).

4. For an array in local memory, which means that each thread will have its private copy, the elements for all the threads in a warp are interleaved to potentially increase the I/O rate.

5. A thread can read variables from other threads in the same warp, with the shuffle instruction. Typical operation are to read from the K-th next thread, to do a butterfly permutation, or to do an indexed read. This happens in parallel for the whole warp, and does not use shared memory.

6. A warp vote combines a bit computed by each thread to report results like all or any.

11. Block-level resources:

1. A block may contain up to 1024 threads.

3. Each block can use up to 49152 bytes of the SM's fast shared memory. The block's shared memory is shared by all the threads in the block, but is hidden from other blocks.

Shared memory is basically a user-controllable cache of some global data. The saving comes from reusing that shared data several times after you loaded it from global memory once.

Shared memory is interleaved in banks so that some access patterns are faster than others.

4. Warps in a block run asynchronously and run different instructions. They are scheduled and executed as resources are available.

Because of how warps are scheduled, that can be slow.

6. The threads in a block can be arranged into a 3D array, up to 1024x1024x64.

That is for convenience, and does not increase performance (I think).

7. I'll talk about textures later.

12. Streaming Multiprocessor (SM) - level resources:

1. Each SM has 128 single-precision CUDA cores, 64 double-precision units, 32 special function units, and 32 load/store units.

2. In total, the GPU has 2560 CUDA cores.

3. A CUDA core is akin to an ALU. The cores, and all the units, are pipelined.

4. A CUDA core is much less powerful than one core of an Intel Xeon. My guess is 1/20th.

5. Beware that, in the CUDA C Programming Guide, NVidia sometimes calls an SM a core.

6. The limited number of, e.g., double precision units means that an DP instruction will need to be scheduled several times for all the threads to execute it. That's why DP is slower.

7. Each SM has 4 warp schedulers and 8 instruction dispatch units.

8. 64 warps can simultaneously reside in an SM.

9. Therefore up to 32x64=2048 threads can be executed in parallel by an SM.

10. Up to 16 blocks that can simultaneously be resident in an SM.

However, if each block uses too many resources, like shared memory, then this number is reduced.

Each block sits on only one SM; no block is split. However a block's warps are executed asynchronously (until synced).

11. Each SM has 64KiB (?) fast memory to be divided between shared memory and an L1 cache. Typically, 48KiB (96?) is used for the shared memory, to be divided among its resident blocks, but that can be changed.

12. The 48KB L1 cache can cache local or global memory.

13. Each SM has a read-only data cache of 48KB to cache the global constant memory.

14. Each SM has 8 texture units, and many other graphics capabilities.

15. Each SM has 256KB of L2 cacha.

13. Grid-level resources:

1. The blocks in a grid can be arranged into a 3D array. up to $(2^{31}-1, 2^{16}-1, 2^{16}-1)$.
2. Blocks in a grid might run on different SMs.
3. Blocks in a grid are queued and executed as resources are available, in an unpredictable parallel or serial order. Therefore they should be independent of each other.
4. The number of instructions in a kernel is limited.
5. Any thread can stop the kernel by calling assert.
14. Device-level resources:

1. There is a large and slow 8GB global memory, which persists from kernel to kernel.

Transactions to global memory are 128 bytes.

Host memory can also be memory-mapped into global memory, although the I/O rate will be lower.

Reading from global memory can take hundreds of cycles. A warp that does this will be paused and another warp started. Such context switching is very efficient. Therefore device throughput stays high, although there is a latency. This is called Thread Level Parallelism (TLP) and is a major reason for GPU performance.

That assumes that an SM has enough active warps that there is always another warp available for execution. That is a reason for having warps that do not use all the resources (registers etc) that they're allowed to.

2. There is a 2MB L2 cache, for sharing data between SMs.

3. There is a 64KiB Small and fast global constant memory, , which also persists from kernel to kernel. It is implemented as a piece of the global memory, made fast with caches.

(Again, I'm still resolving this apparent contradiction).

4. Grid Management Unit (GMU) schedules (pauses, executes, etc) grids on the device. This is more important because grids can start other grids (Dynamic Parallelism).

5. Hyper-Q: 32 simultaneous CPU tasks can launch kernels into the queue; they don't block each other. If one kernel is waiting, another runs.

6. CUDA Work Distributor (CWD) dispatches 32 active grids at a time to the SMs. There may be 1000s of grids queued and waiting.

7. GPU Direct: Other devices can DMA the GPU memory.

8. The base clock is 1607MHz.

9. GFLOPS: 8873.

10. Memory bandwidth: 320GB/s

15. GPU-level resources:

1. Being a Geforce product, there are many graphics facilities that we're not using.
2. There are 4 Graphics processing clusters (GPCs) to do graphics stuff.
3. Several perspective projections can be computed in parallel, for systems with several displays.
4. There's HW for texture processing.
16. Generational changes:

1. With each new version, Nvidia tweaks the numbers. Some get higher, others get lower.
1. E.g., Maxwell had little HW for double precision, and so that was slow.
2. Pascal's clock speed is much higher.
17. Refs:

1. The CUDA program deviceDrv.
4. Better Performance at Lower Occupancy, Vasily Volkov, UC Berkeley, 2010.
5. https://www.pgroup.com/lit/articles/insider/v2n1a5.htm - well written but old.

(I'll keep adding to this. Suggestions are welcome.)

## 6   More CUDA

1. CUDA function qualifiers:

1. __global__ device function called from host, starting a kernel.
2. __device__ device function called from device function.
3. __host__ (default) host function called from host function.
2. CUDA variable qualifiers:

1. __shared__
2. __device__ global
3. __device__ __managed__ automatically paged between host and device.
4. __constant__
5. (nothing) register if scalar, or local if array or if no more registers available.
3. If installing CUDA on your machine, this repository seems best:

That includes the Thrust headers but not example programs.

## 7   Thrust

1. Thrust is an API that looks like STL. Its backend can be CUDA, OpenMP, or sequential host-based code.

2. The online Thrust directory structure is a mess. Three main sites appear to be these:

1. The best way to install it is to clone from here.
3. The wiki has a lot of doc.
1. https://thrust.github.io/

This points to the above site.

2. https://developer.nvidia.com/thrust

This has links to other Nvidia docs, some of which are obsolete.

3. http://docs.nvidia.com/cuda/thrust/index.html

4. https://code.google.com/ - no longer exists.

4. Functional-programming philosophy.

5. Many possible backends: host, GPU, OpenMP, TBB...

6. Easier programming, once you get used to it.

7. Code is efficient.

9. Since the Stanford slides were created, Thrust has adopted unified addressing, so that pointers know whether they are host or device.

10. On parallel in /parallel-class/thrust/ are many little demo programs from the thrust distribution, with my additions.

11. CUDACast videos on Thrust:

CUDACast #.15 - Introduction to Thrust

CUDACast #.16 - Thrust Algorithms and Custom Operators

12. Thrust is fast because the functions that look like they would need linear time really take only log time in parallel.

13. In functions like reduce and transform, you often see an argument like thrust::multiplies<float>(). The syntax is as follows:

1. thrust::multiplies<float> is a class.
3. However, in the call to reduce, thrust::multiplies<float>() is calling the default constructor to construct a variable of class thrust::multiplies<float>, and passing it to reduce.
4. reduce will treat its argument as a function name and call it with an argument, triggering operator().
5. You may also create your own variable of that class, e.g., thrust::multiplies<float> foo. Then you just say foo in the argument list, not foo().
6. The optimizing compiler will replace the operator() function call with the defining expression and then continue optimizing. So, there is no overhead, unlike if you passed in a pointer to a function.
14. Sometimes, e.g., in saxpy.cu, you see saxpy_functor(A).

1. The class saxpy_functor has a constructor taking one argument.
2. saxpy_functor(A) constructs and returns a variable of class saxpy_functor and stores A in the variable.
3. The class also overloads operator().
4. (Let's call the new variable foo). foo() calls operator() for foo; its execution uses the stored A.
5. Effectively, we did a closure of saxpy_functor; this is, we bound a property and returned a new, more restricted, variable or class.
15. The Thrust examples teach several non-intuitive paradigms. As I figure them out, I'll describe a few. My descriptions are modified and expanded versions of the comments in the programs. This is not a list of all the useful programs, but only of some where I am adding to their comments.

1. arbitrary_transformation.cu and dot_products_with_zip.cu. show the very useful zip_iterator. Using it is a 2-step process.

1. Combine the separate iterators into a tuple.
2. Construct a zip iterator from the tuple.

Note that operator() is now a template.

2. boundingbox.cu finds the bounding box around a set of 2D points.

The main idea is to do a reduce. However, the combining operation, instead of addition, is to combine two bounding boxes to find the box around them.

3. bucket_sort2d.cu overlays a grid on a set of 2D points and finds the points in each grid cell (bucket).

1. The tuple is an efficient class for a short vector of fixed length.

2. Note how random numbers are generated. You combine an engine that produces random output with a distribution.

However you might need more complicated coding to make the numbers good when executing in parallel. See monte_carlo_disjoint_sequences.cu.

3. The problem is that the number of points in each cell is unpredictable.

4. The cell containing each point is computed and that and the points are sorted to bring together the points in each cell.

5. Then lower_bound and upper_bound are used to find each bucket in that sorted vector of points.

6. See this lower_bound description.

4. mode.cu shows:

1. Counting the number of unique keys in a vector.
1. Sort the vector.
2. Do an inner_product. However, instead of the operators being times and plus, they are not equal to the next element and plus.
2. Counting their multiplicity.
1. Construct vectors, sized at the number of unique keys, to hold the unique keys and counts.
2. Do a reduce_by_keys on a constant_iterator using the sorted vector as the keys. For each range of identical keys, it sums the constant_iterator. That is, it counts the number of identical keys.
3. Write a vector of unique keys and a vector of the counts.
3. Finding the most used key (the mode).
1. Do max_element on the counts vector.
5. repeated_range.cu repeats each element of an N-vector K times: repeated_range([0, 1, 2, 3], 2) -> [0, 0, 1, 1, 2, 2, 3, 3]. It's a lite version of expand.cu, but uses a different technique.

1. Here, N=4 and K=2.
2. The idea is to construct a new iterator, repeated_range, that, when read and incremented, will return the proper output elements.
3. The construction stores the relevant info in structure components of the variable.
4. Treating its value like a subscript in the range [0,N*K), it divides that value by K and returns that element of its input.

# PAR Lecture 15, Thurs Mar 9

## 1   Student presentations about GTC

1. DP-S, Fast Splittable Pseudorandom Number Generator
2. CF
3. MB, Parallel Methods for Verifying the Consistency of Weakly-Ordered Architectures
4. NB
5. TN
6. ET
7. ZX

## 2   Unionfs: Linux trick of the day

1. aka overlay FS, translucent FS.

2. If a, b are directories, and m is an empty directory, then

unionfs -o cow a=RW:b m

makes m to be a combo of a and b, with a being higher priority

3. Writing a file into m writes it in a.

4. Changing a file in b writes the new version into a

5. Deleting a file in b causes a white-out note to be stored in a.

6. Unmount it thus:

fusermount -u m

7. None of this requires superuser.

9. Note: IBM had a commercial version of this idea in its CP/CMS OS in the 1960s.

# PAR Homework 5

Hand in your solution on RPILMS, unless instructions say otherwise. Each team should submit its solution under only 1 student's name. The other student's submission should just name the lead student. (This makes it easier for us to avoid grading it twice.)

If you have problems, then ask for help. The goal is to learn the material.

## Homework 5, due Fri 2017-03-10, 9am.

1. Rewrite the potential energy problem to be more efficient:
1. Impose a 100x100x100 grid on the data.
2. Sort the points into the 1M cells.
3. For each point:
1. Compute the potential of it against all the points in its cell and the 26 (or fewer) neighboring cells.
2. For farther cells, aggregate the points in each cell and represent them as one heavy point at the cell center.
2. Implement this 3 times, using:
1. OpenMP,
2. CUDA,
3. Thrust.
3. Report the times, on the bunny and blade, and on a new point set in /parallel-class/data/unc . It's 5423053 points from a powerplant, from UNC. The points are very unevenly spaced.
4. Report how different your answers are for the 3 implementation. Explain any differences.
5. Report the time and accuracy of this method compared to your original, brute force, method.

# PAR Lecture 13, Thu Mar 2

## 1   Thrust

Stanford lecture 8: Thrust. This is a functional frontend to various backends, such as CUDA.

Programs are on parallel in /parallel-class/thrust/ .

## 2   Term project

Time to start thinking. More details Monday. If you'd like to work on something related to my interests, great! This might turn into a paper.

# PAR Lecture 12, Mon Feb 27

## 1   Parallel.ecse programs

1. /parallel-class/cuda/checksum.cc shows a significant digits problem when you add many small numbers.

2. sum_reduction.cu is Stanford's program.

3. sum_reduction2.cu is my modification to use managed memory.

Note how both sum_reduction and sum_reduction2 give different answers for the serial and the parallel computation. That is bad.

4. sum_reduction3.cu is a mod to try to find the problem. One problem is insufficient precision in the sum. Using double works. However there might be other problems.

## 2   Stanford lectures

Continue with slide 29 of Lecture 6 parallel patterns 1, which presents some paradigms of parallel programming. These are generally useful building blocks for parallel algorithms.

Then do lecture_7.

# PAR Homework 4

Hand in your solution on RPILMS, unless instructions say otherwise. Each team should submit its solution under only 1 student's name. The other student's submission should just name the lead student. (This makes it easier for us to avoid grading it twice.)

If you have problems, then ask for help. The goal is to learn the material.

## Homework 4, due Fri 2017-03-03, 9am.

### Presentation

Each person pick a topic from the 2016 GTC and give a 3 minute talk on it. We'll do 10 on Mon Mar 6 and the rest on Thurs Mar 9. Email Yin Li with your preferred date; first come, first served.

### Programming questions

1. Rewrite my CUDA matrix multiplication program, matmul2.cu, to tile the matrix into shared memory. How much faster is it?
2. Now do it with cuBLAS. (You have to learn enough cuBLAS to do this.)

# PAR Lecture 11, Thurs Feb 23

## 1   Parallel.ecse programs

/parallel-class/cuda/matmul2.cu plays with matrix multiplication of two random 1000x1000 matrices, and with managed memory.

1. It shows a really quick way to use OpenMP, which has a 10x speedup.
2. It shows a really quick way to use CUDA, which has a 15x speedup. This just uses one thread block per input matrix A row, and one thread per output element. That's 1M threads. The data is read from managed memory. Note how easy it is. There is no storing tiles into fast shared memory.
3. It multiplies the matrices on the host, and compares reading them from normal memory and from managed memory. The latter is 2.5x slower. Dunno why.
4. matmul2 also shows some of my utility functions and macros.
1. InitCUDA prints a description of the GPU etc.
2. cout << PRINTC(expr) prints an expression's name and then its value and a comma.
3. PRINTN is like PRINTC but ends with endl.
4. TIME(expr) evals an expression then prints its name and total and delta
1. CPU time,
2. elapsed time,
3. their ratio.
5. CT evals and prints the elapsed time of a CUDA kernel.
6. Later I may add new tests to this.

## 2   Computer factoid

Unrelated to this course, but perhaps interesting:

All compute servers have for decades had a microprocessor frontend that controls the boot process. The current iteration is called an IPMI. It has a separate ethernet port, and would allow remote bios configs and booting. On parallel, that port is not connected (I don't trust the security).

## 3   CUDA Doc

The start of Nvidia's Parallel thread execution has useful info.

This is one of a batch of CUDA docs. Browse as you wish.

## 4   Stanford lectures

All the lectures and associated files are on parallel.ecse at /parallel-class/stanford/ .

They are also online here.

Lecture 6 parallel patterns 1 presents some paradigms of parallel programming. These are generally useful building blocks for parallel algorithms.

# PAR Lecture 10, Tues Feb 21

## 1   Vim hint

To get vim to show line numbers, create a file ~/.exrc containing this line:

:se nu

It will be read everytime vim starts, and will set line number mode.

## 2   CUDA Doc

Nvidia's CUDA programming guide is excellent, albeit obsolescent in places. The Pascal info looks like it's been tacked onto an older document.

The whitepaper NVIDIA GeForce GTX 1080 describes, from a gaming point of view, the P104 GPU, which is in the GTX 1080, the card in parallel.ecse.

NVIDIA now has a higher level GPU, the P100, described in the P100 whitepaper and P100 technical overview. Note that the P100 is a Tesla (scientific computing) not a GeForce (gaming). This description is much more technical.

## 3   Managed memory issues

I'm sometimes seeing a 2.5x speed reduction using managed memory on the host, compared to using unmanaged memory. Dunno what's happening.

## 4   Stanford lectures

Last time, in lecture 4 we saw how to

1. cache data into shared memory for speed, and
2. use hierarchical sync.

Lecture 5 performance considerations shows how to fine tune your program once it's already working, if you need the extra speed.

Lecture 6 parallel patterns 1 presents some paradigms of parallel programming. These are generally useful building blocks for parallel algorithms.

# PAR Homework 3

Hand in your solution on RPILMS, unless instructions say otherwise. Each team should submit its solution under only 1 student's name. The other student's submission should just name the lead student. (This makes it easier for us to avoid grading it twice.)

If you have problems, then ask for help. The goal is to learn the material.

## Homework 3, due Fri 2017-02-24, 9am.

### Paper questions

1. Research and then describe the main changes from NVidia Maxwell to Pascal.
2. Although a thread can use 255 registers, that might be bad for performance. Why?
3. Give a common way that the various threads in a block can share data with each other.
4. Reading a word from global memory might take 400 cycles. Does that mean that a thread that reads many words from global memory will always take hundreds of times longer to complete?
5. Since the threads in a warp are executed in a SIMD fashion, how can an if-then-else block be executed?
6. What is unified virtual addressing and how does it make CUDA programming easier?

### Programming questions

1. Repeat homework 2's matrix multiplication problem, this time in CUDA. Report how much parallel speedup you get.

2. Look at the dataset /parallel-class/data/bunny. It contains 35947 points for the Stanford bunny.

Assuming that each point has a mass of 1, and is gravitationally attracted to the others, compute the potential energy of the system. The formula is this:

$U = - \sum_{i=1}^{N-1} \sum_{j=i+1}^N \frac{1}{r_{ij}}$

where $r_{ij}$ is the distance between points $i$ and $j$ . (This assumes that G=1).

3. Now look at the dataset /parallel-class/data/blade, which contains 882954 points for a turbine blade. Can you process it?

# PAR Lecture 9, Thurs Feb 16

## 1   Homeworks

Homework 3 is online.

## 2   CUDA

Last time we saw 2 ways to create managed variables. They can be accessed by either the host or the device and are paged automatically. This makes programming much easier.

1. Create static variables with __device__ __managed__. See /parallel-class/stanford/tutorials/vector_addition2.cu on parallel.
2. Use cudaMallocManaged. See /parallel-class/stanford/tutorials/vector_addition3.cu on parallel.
3. In either case, you need to call cudaDeviceSynchronize(); on the host after starting a parallel kernel before reading the data on the host. The reason is that the kernel is started asynchonously and control returns while it is still executing.
4. When the linux kernel gets HMM (heterogeneous memory management), all data on the heap will automatically be managed.
5. The reason is that virtual addresses are long enough to contain a tag saying what device they are on. The VM page mapper will read and write pages to various devices, not just swap files.
6. Any CUDA example using cudaMemcpy is now obsolete (on Pascal GPUs).

# PAR Lecture 8, Mon Feb 13

## 1   My web site

Moved to my own virtual server (Thanks).

This improves security.

http: redirects to https: .

You should see no difference; please report any problems.

I'm creating most new pages with the static site generator Nikola.

1. The big reason is that the pmwiki project is not as dynamic as it was when I picked it. Its cookbooks, which I use a lot, are not being updated as php is updated.
2. Also, static site generators, which did not exist, are a cool idea.
3. Finally, nikola is more powerful for most purposes and looks better.
4. Nevertheless, I'll continue to use pmwiki for collaborative blogs, which nikola doesn't do. (My students report results to me using pmwiki blogs. They are an online version of lab notebooks. I highly recommend this.)

## 2   CUDA ideas

1. With CUDA, the dominant problem in program optimization is optimizing the data flow. Getting the data quickly to the cores is harder than processing it. It helps big to have regular arrays, where each core reads or writes a successive entry.

This is analogous to the hardware fact that wires are bigger (hence, more expensive) than gates.

2. That is the opposite optimization to OpenMP, where having different threads writing to adjacent addresses will cause the false sharing problem.

## 3   CUDA info on the web

1. https://developer.nvidia.com/cuda-faq
1. has links to other Nvidia docs.
2. is a little old. Kepler and Fermi are 2 and 3 generations old.

## 4   CUDA programs

1. parallel.ecse.rpi.edu now has CUDA 8.

2. The demo programs are in /local/cuda/samples/ . Their coding style is suboptimal. However, in /local/cuda/samples/1_Utilities/ , bandwidthTest and deviceQuery are interesting.

For your convenience, /parallel-class/deviceQuery is a link. Run it to see the GPU's capabilities.

3. I will try to use parallel for CUDA because it has a Pascal CPU; geoxeon has a Maxwell and a Kepler GPU.

4. The program nvidia-smi shows the current load on the GPU.

5. My web copy of the tutorial programs from Stanford's parallel course notes is also on parallel at /parallel-class/stanford/tutorials/ .

1. I've edited some of them, and put the originals in orig/ , and created new ones.

2. To compile them, you need /local/cuda/bin in your PATH and /local/cuda/lib64 in your LD_LIBRARY_PATH .

3. Name your source program foo.cu for some foo .

4. Compile it thus: nvcc foo.cu -o foo .

5. hello_world.cu shows a simple CUDA program and uses a hack to print from a device function.

6. hello_world2.cu shows printing from several threads.

7. global_functions.cu shows some basic CUDA stuff.

8. device_functions.cu extends it.

9. vector_addition.cu does (you figure it out).

10. vector_addition2.cu is my modification to use unified memory, per http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html . I also cleaned up the code and shrank the number of lines for better display.

IMO, unified memory makes programming a lot easier.

Notes:

1. In linux, what's the easiest way to find the smallest prime larger than a given number?

2. To find the number of blocks needed for N threads, you can do it the Stanford way:

grid_size = num_elements / block_size; if(num_elements % block_size) ++grid_size;

or you can do it the RPI (i.e., my) way:

grid_size = (num_elements + block_size - 1) / block_size;

## 5   Nvidia conceptual hierarchy

As always, this is as I understand it, and could be wrong. Nvidia uses their own terminology inconsistently. They may use one name for two things (E.g., Tesla and GPU), and may use two names for one thing (e.g., module and accelerator). As time progresses, they change their terminology.

1. At the bottom is the hardware micro-architecture. This is an API that defines things like the available operations. The last several Nvidia micro-architecture generations are, in order, Tesla (which introduced unified shaders), Fermi, Kepler, Maxwell (introduced in 2014), Pascal (2016), and Volta (future).
2. Each micro-architecture is implemented in several different microprocessors. E.g., the Kepler micro-architecture is embodied in the GK107, GK110, etc. Pascal is GP104 etc. The second letter describes the micro-architecture. Different microprocessors with the same micro-architecture may have different amounts of various resources, like the number of processors and clock rate.
3. To be used, microprocessors are embedded in graphics cards, aka modules or accelerators, which are grouped into series such as GeForce, Quadro, etc. Confusingly, there is a Tesla computing module that may use any of the Tesla, Fermi, or Kepler micro-architectures. Two different modules using the same microprocessor may have different amounts of memory and other resources. These are the components that you buy and insert into a computer. A typical name is GeForce GTX1080.
4. There are many slightly different accelerators with the same architecture, but different clock speeds and memory, e.g. 1080, 1070, 1060, ...
5. The same accelerator may be manufactured by different vendors, as well as by Nvidia. These different versions may have slightly different parameters. Nvidia's reference version may be relatively low performance.
6. The term GPU sometimes refers to the microprocessor and sometimes to the module.
7. There are four families of modules: GeForce for gamers, Quadro for professionals, Tesla for computation, and Tegra for mobility.
8. Nvidia uses the term Tesla in two unrelated ways. It is an obsolete architecture generation and a module family.
9. Geoxeon has a (Maxwell) GeForce GTX Titan and a (Kepler) Tesla K20xm. Parallel has a (Pascal) GeForce GTX 1080. We also have an unused (Kepler) Quadro K5000.
10. Since the highest-end (Tesla) modules don't have video out, they are also called something like compute modules.

## 6   CUDA versions

1. CUDA has a capability version, whose major number corresponds to the micro-architecture generation. Kepler is 3.x. The K20xm is 3.5. The GTX 1080 is 6.1. Here is a table of the properties of different compute capabilities. However, that table is not completely consistent with what deviceQuery shows, e.g., the shared memory size.

2. nvcc, the CUDA compiler, can be told which capabilities (aka architectures) to compile for. They can be given as a real architecture, e.g., sm_61, or a virtual architecture. e.g., compute_61.

Just use the option -arch=compute_61.

3. The CUDA driver and runtime also have a software version, defining things like available C++ functions. The latest is 8.0. This is unrelated to the capability version.

## 7   GPU range of speeds

Here is an example of the wide range of Nvidia GPU speeds; all times are +-20%.

The GTX 1080 has 2560 CUDA cores @ 1.73GHz and 8GB of memory. matrixMulCUBLAS runs at 3136 GFlops. However the reported time (0.063 msec) is so small that it may be inaccurate. The quoted speed of the 1080 is about triple that. I'm impressed that the measured performance is so close.

The Quadro K2100M in one of my Lenovo W540 laptop has 576 CUDA cores @ 0.67 GHz and 2GB of memory. matrixMulCUBLAS runs at 320 GFlops. The time on the GPU was about .7 msec, and on the CPU 600 msec.

It's nice that the performance almost scaled with the number of cores and clock speed.

## 8   CUDA

Last time we saw 2 ways to create managed variables. They can be accessed by either the host or the device and are paged automatically. This makes programming much easier.

1. Create static variables with __device__ __managed__. See /parallel-class/stanford/tutorials/vector_addition2.cu on parallel.
2. Use cudaMallocManaged. See /parallel-class/stanford/tutorials/vector_addition3.cu on parallel.
3. In either case, you need to call cudaDeviceSynchronize(); on the host after starting a parallel kernel before reading the data on the host. The reason is that the kernel is started asynchonously and control returns while it is still executing.
4. When the linux kernel gets HMM (heterogeneous memory management), all data on the heap will automatically be managed.
5. The reason is that virtual addresses are long enough to contain a tag saying what device they are on. The VM page mapper will read and write pages to various devices, not just swap files.
6. Any CUDA example using cudaMemcpy is now obsolete (on Pascal GPUs).

# PAR Lecture 7, Thu Feb 9

Canceled because RPI closed because of blizzard.

# PAR Lecture 6, Mon Feb 6

## 1   Continuously updating geoxeon and parallel

To get new features working, I'm continuously upgrading things.

Sometimes that causes outages, sorry.

/afs is not working now. However, I don't think anyone uses it.

## 2   CUDA

From last Thurs, I've been lecturing from Stanford's parallel course notes , which they've made freely available. (Thanks.)

They are very well done.

They are obsolete in some places, which I'll mention.

However, a lot of newer CUDA material is also obsolete.

# PAR Homework 2

Hand in your solution on RPILMS, unless instructions say otherwise. Each team should submit its solution under only 1 student's name. The other student's submission should just name the lead student. (This makes it easier for us to avoid grading it twice.)

If you have problems, then ask for help. The goal is to learn the material.

## Homework 2, due Fri 2017-02-03, 9am.

1. Write a program to multiply two 100x100 matrices. Do it the conventional way, not using anything fancy like Schonhage-Strassen. Now, see how much improvement you can get with OpenMP. Measure only the elapsed time for the multiplication, not for the matrix initialization.

Report these execution times, on geoxeon and parallel.

1. W/o openmp enabled (Don't use -fopenmp. Comment out the pragmas.)
2. With openmp, using only 1 thread.
3. Using 2, 4, 8, 16, 32, 64 threads.
2. Write programs to test the effect of the reduction pragma:

1. Create an array of 1,000,000,000 floats and fill it with pseudorandom numbers from 0 to 1.
2. Do the following tests with 1, 2, 4, 8, 16, and 32 threads.

Programs to write and test:

1. Sum it with a simple for loop. This will give a wrong answer with more than 1 thread, but is fast.
2. Sum it with the subtotal variable protected with a atomic pragma.
3. Sum it with the subtotal variable protected with a critical pragma.
4. Sum it with a reduction loop.

4. Sometime parallelizing a program can increase its elapsed time. Try to create such an example, with 2 threads being slower than 1.

# PAR Lecture 5, Thu Feb 2

## 1   Compiling and running

1. c++ compilers:

1. geoxeon has 5.4 by default, and 6.2 if you use g++-6

2. parallel has 4.8.5 by default, and 5.3.1 in /opt/devtoolset-4/root/bin/

The problem is that parallel runs CentOS which makes it very hard to install recent versions of SW. Installing gcc-6 might require compiling the source.

2. Measuring program times when many people are on the machine:

1. This is a problem.
2. One partial solution is to use batch to queue your job to run when the system is unloaded.

## 2   Homework 2

Homework 2 is online. I'll leave time at the end of this and the next class class for you to start and ask questions.

## 3   More OpenMP

1. Another nice, albeit obsolescent and incomplete, ref: Wikibooks OpenMP.

2. We'll look at its tasks examples.

1. While inside a pragma parallel, you queue up lots of tasks - they're executed as threads are available.

2. My example is tasks.cc with some variants.

3. taskfib.cc is modified from an example in the OpenMP spec.

2. This is only an example; you would never implement fibonacci that way. (The fastest way is the following closed form. In spite of the sqrts, the expression always gives an integer. )

$F_n = \frac{\left( \frac{1+\sqrt{5}}{2} \right) ^n - \left( \frac{1-\sqrt{5}}{2} \right) ^n }{\sqrt{5}}$

3. Spawning a task is expensive; we can calculate the cost.

4. stacksize.cc - get and set stack size. Can also do ulimit -s.

5. omps.cc - read assorted OMP variables.

6. Since I/O is not thread-safe, in hello_single_barrier.cc it has to be protected by a critical and also followed by a barrier.

7. OpenMP sections - my example is sections.cc with some variants.

1. Note my cool way to print an expression's name followed by its value.
2. Note the 3 required levels of pragmas: parallel, sections, section.
3. The assignment of sections to threads is nondeterministic.
4. IMNSHO, OpenMP considerably easier than pthreads, fork, etc.

9. What you should have learned from the OpenMP lectures:

1. How to use a well-designed and widely used API that is useful on almost all current CPUs.
2. Shared memory: the most common parallel architecture.
3. How to structure a program to be parallelizable.
4. Issues in parallel programs, like nondeterminism, race conditions, critical regions, etc.

## 4   ssh, afs, zfs

1. I recommend that you create key pairs to connect to geoxeon and parallel w/o typing your password each time.

2. To avoid retyping your ssh private key passphrase in the same shell, do this

3. One advantage of ssh is that you can mount your geoxeon directory on your local machine. On my laptop, I run nemo, then do File - Connect to server, choose type ssh, server geoxeon.ecse.rpi.edu, use my RCSID as the user name, leave the password blank, and give my private key passphrase if asked. Any other program to mount network shares would work as well.

Where the share is actually mounted varies. On my laptop, it's here: /var/run/user/1000/gvfs/sftp:host=geoxeon.ecse.rpi.edu,user=wrf

As with any network mount, fancy filesystem things like attributes, simultaneous reading and writing from different machines, etc., probably won't work.

4. With ssh, you can also copy files back and forth w/o mounting or typing passwords:

1. scp localfile geoxeon.ecse.rpi.edu:
2. scp -r localdir geoxeon.ecse.rpi.edu:
3. scp geoxeon.ecse.rpi.edu:remotefile .

It even does filename completion on remote files.

5. You can also run single commands:

ssh geoxeon.ecse.rpi.edu hostname

6. Geoxeon implements AFS, so your RCS files are accessible (read and write) at

/afs/rpi/edu/home/??/RCSUID.

Search for your home dir thus:

ls -ld /afs/rpi.edu/home//RCSID*.

Authenticate yourself with klog.

7. On geoxeon, your files are transparently compressed, so there's no need to explicitly use gzip etc.

However, files on parallel are not transparently compressed.

## 5   Types of memory allocation

Here's a brief overview of my understanding of the various places that you can assign memory in a program.

1. Static. Define a fixed-size array global array. The variable is constructed at compile time, so accesses might perhaps be faster. Global vars with non default initial values increase the executable file size. If they're large enough, you need to use the compiler option -mcmodel=medium or -mcmodel=large. I don't know the effect on the program's size or speed.
2. Stack. Define local arrays, that are created and freed as the routine is entered and exited. Their addresses relative to the base of this call frame may be constant. The default stack size is 8MB. You can increase this with the command ulimit or in the program as shown in stacksize.cc. I believe that in OpenMP, the max stacksize may be allocated when each thread is created. Then, a really big stackssize might have a penalty.
3. Heap. You use new and destroy. Variables are constructed whenever you want. The more objects on the heap, the more time that each new or destroy takes. If you have lots of objects consider using placement new or creating an array of them.

I like to use static, then stack, and heap only when necessary.

Google's allocator is noticeably better than the default one. To use it, link your programs with -ltcmalloc. You can often use it on an existing executable foo thus:

I found it to save 15% to 30% in time.

Another memory concern is speed. Geoxeon has a NUMA (Non Uniform Memory Architecture). It has two 8-core Xeons. Each core has 64GB of main memory. Although all 128GB are in a common address space, accessing memory on same core as the thread is running on is faster.

The following is what I think based on some research, but may be wrong: A 4KB page of memory is assigned to a specific core when it is first written (not when it is reserved). So, each page of a large array may be on a different core. This can be used to optimize things. This gets more fun with 8-processor systems.

All that is separate from cache issues.

You can also assign your OpenMP threads to specific cores. This affects speed in ways I don't understand. The issues are resource sharing vs conflicts.

## 6   NVidia device architecture, start of CUDA

1. The above shared memory model hits a wall; CUDA handles the other side of the wall.

# PAR Lecture 4, Mon Jan 30

1. Today is the last day to switch to the 6000 version of this course.

2. Accounts have been set up on parallel.ecse.rpi.edu .

3. parallel is accessible from on campus; from off campus use the VPN.

4. I copied over all your passwords and home directories from geoxeon.

5. I copied over geoxeon:/home/parcomp/Class including the yellowcab file

6. On both machines, I created a link /parallel-class to point to /home/parcomp/Class

1. That will get you specific material for this class.
2. However you're welcome to roam around on geoxeon to look at stuff from last time and to look at other yellowcab files etc.
3. My initial security policy for this class is to install reasonable security but then trust you not to abuse it.
4. Should you find something that is readable but probably should be protected, please tell me.
7. However, they are two separate computers that are not synchonized.

1. Changing your password on one machine won't change it on the other.
2. The HW is different.
3. The OS is different: Ubuntu on geoxeon, and CentOS on parallel.
4. Versions of SW may be different.
5. CentOS is a more production-oriented OS, and so by default installs older versions.
6. I'm working to put current SW on parallel, but it takes time since I'm the person maintaining it.
8. I recommend using parallel whenever you can.

9. I may put new stuff only on parallel.

10. However, currently only geoxeon has CUDA fully installed.

11. We saw that running even a short OpenMP program could be unpredictable. The reason for that was a big lesson.

This happened during the first space shuttle launch in 1981. A 1-in-70 chance event prevented the computers from synchonizing. This had been observed once before during testing. However when they couldn't get it to happen again, they ignored it.

This interlacing of different threads can happen at the machine instruction level. The C statement K=K+3 can translate into the machine instructions

Let's color the threads red and blue.

then K increases by 6. If they interlace thus:

then K increases by 3.

12. This can be really hard to debug, particularly when you don't know where this is happening.

One solution is to serialize the offending code, so that only one thread executes it at a time. The limit would be to serialize the whole program and not to use parallel processing.

13. OpenMP has several mechanisms to help.

14. A critical pragma serializes the following block. There are two considerations.

1. You lose the benefits of parallelization on that block.
2. There is an overhead to set up a critical block that I estimate might be 100,000 instructions.
15. An atomic pragma serializes one short C statement, which must be one of a small set of allowed operations, such as increment. It matches certain atomic machine instructions, like test-and-set. The overhead is much smaller.

16. If every thread is summing into a common total variable, the reduction pragma causes each thread to sum into a private subtotal, and then sum the subtotals. This is very fast.

17. Another lesson is that sometimes you can check your program's correctness with an independent computation. For the trivial example of summing $i^2$, use the formula

$\sum_{i=1}^N i^2 = N(N+1)(2N+1)/6$.

There is a lot of useful algebra if you have the time to learn it. I, ahem, learned this formula in high school.

18. Another lesson is that even when the program gives the same answer every time, it might still be consistently wrong.

19. Another is that just including OpenMP facilities, like -fopenmp, into your program slows it down even if you don't use them.

20. Another is that the only meaningful time metric is elapsed real time. One reason that CPU time is meaningless is the OpenMP sometimes pauses a thread's progress by making it do a CPU-bound loop. (That is also a common technique in HW.)

21. Also note that CPU times can vary considerably with successive executions.

22. Also, using too many threads might increase the real time.

23. Finally, floating point numbers have their own problems. They are an approximation of the mathematical concept called the real number field. That is defined by 11 axioms that state obvious properties, like

A+B=B+A (commutativity) and

A+(B+C)=(A+B+C) (associativity).

This is covered in courses on modern algebra or abstract algebra.

The problem is that most of these are violated, at least somewhat, with floats. E.g.,

$\left(10^{20}-10^{20}\right)+1=0+1=1$ but

$10^{20}+\left(-10^{20}+1\right)=10^{20}-10^{20}=0$.

Therefore when threads execute in a different order, floating results might be different.

There is no perfect solution, though using double is a start.

1. On modern CPUs, double is just as fast as float.

2. However it's slower on GPUs.

How much slower can vary. Nvidia's Maxwell line spent very little real estate on double precision, so it was very slow. In contrast, Maxwell added a half-precision float, for implementing neural nets. Apparently neural nets use very few significant digits.

Nvidia's newer Pascal line reverted this design choice and spends more real estate on implementing double. parallel.ecse's GPU is Pascal.

3. It also requires moving twice the data, and data movement is often more important than CPU time.

The large field of numerical analysis is devoted to finding solutions, with more robust and stable algorithms.

Summing an array by first sorting, and then summing the absolutely smaller elements first is one technique. Inverting a matrix by pivoting on the absolutely largest element, instead of on $a_{11}$ is another.

24. Next class: start CUDA.

# PAR Lecture 3, Thurs Jan 26

1. ECSE-6967 Parallel Computing for Engineers, Spring 2017 (CRN 39429) is now online, if you wish to switch to it.
2. The requirement for this course will be everything for ECSE-4940 plus an extra more theoretical paper, in the style of a conference paper. It should be about six to eight typeset pages.
3. More OpenMP

1. OpenMP

# PAR Lecture 1, Thurs Jan 19

2. Read https://computing.llnl.gov/tutorials/parallel_comp/ for an intro to parallel computing.
3. Some points:
1. Physics limits to processor speed.
2. History of Nvidia.
3. Intel CPUs vs Nvidia CUDA cores.
5. OpenMP vs CUDA.
6. Rate-limiting cost usually I/O not computation.
4. Homework 1 is online, due Jan 26.

# PAR Homework 1

Hand in your solution on RPILMS, unless instructions say otherwise. Each team should submit its solution under only 1 student's name. The other student's submission should just name the lead student. (This makes it easier for us to avoid grading it twice.)

If you have problems, then ask for help. The goal is to learn the material.

## Homework 1, due Thurs 2017-01-26, 9am.

1. Some students prefer to use non-RPI email addresses. Some prefer names that are different from RPI's records. If either is true, then give details.

2. Do you expect to use your own computer? If so, run these commands and report the output:

lspci|grep -i nvidia
nvidia-smi
g++ -v
nvcc -V
`

For this question only, email the info to WRF.

3. Read https://computing.llnl.gov/tutorials/parallel_comp/ for an intro to parallel computing. From it, or from your own research, answer the following questions.

1. Why have machine cycle speeds stopped increasing?
2. What architectures do the top 3 machines on the Top 500 list use?
3. Which one of the following 4 choices are most GPUs: SISD, SIMD, MISD, MIMD.
4. Which one of the following 4 choices are most current multicore CPUs: SISD, SIMD, MISD, MIMD.
5. Per Amdahl's law, if a program is 10% sequential and 90% parallelizable, what is the max speed up that can be obtained with an infinite number of parallel processors?

5. Look at the file on parallel called /home/parallel/whole2015.txt.gz . It is a CSV file listing stats for all yellowcab taxi rides in NYC in 2015. Use whatever legal tools you prefer to exclude obvious errors and then find the most common hour of the day for routes that are longer than the average. Describe your method and include code listings.

# PAR Dates

Thu 2017-01-19: Thu 2017-01-26: 1st class Homework 1 due Homework 2 due Homework 3 due Homework 4 due student quick talks on GTC student quick talks on GTC Homework 5 due student quick talks on term project proposals

# PAR Syllabus

Titles: ECSE-4740-01 Applied Parallel Computing for Engineers, CRN 39207 ECSE-6967-01 Parallel Computing for Engineers, CRN 39429 Spring term annually 3 credit hours Mon and Thurs 4-5:20, JEC 4104.

## 2   History

The 4000-level version of this course has been run twice before as the experimental course ECSE-4965-01. The experiment was deemed a success, so this is now a permanent course listed in the catalog as ECSE-4740-01.

There is also now an affiliated 6000-level topics course. The lectures are in common, but this course requires more work.

## 3   Description

1. This is intended to be a computer engineering course to provide students with knowledge and hands-on experience in developing applications software for affordable parallel processors. This course will cover hardware that any lab can afford to purchase. It will cover the software that, in the prof's opinion, is the most useful. There will also be some theory.
2. The target audiences are ECSE seniors and grads and others with comparable background who wish to develop parallel software.
3. This course will have minimal overlap with parallel courses in Computer Science. We will not teach the IBM BlueGene, because it is so expensive, nor cloud computing and MPI, because most big data problems are in fact small enough to fit on our hardware.
4. You may usefully take all the parallel courses at RPI.
5. This unique features of this course are as follows:
1. Use of only affordable hardware that any lab might purchase, such as Nvidia GPUs. This is currently the most widely used and least expensive parallel platform.
2. Emphasis on learning several programming packages, at the expense of theory. However you will learn a lot about parallel architecture.
6. Hardware taught, with reasons:
Multicore Intel Xeon:
universally available and inexpensive, comparatively easy to program, powerful
Intel Xeon Phi: affordable, potentially powerful, somewhat harder to program
Nvidia GPU accelerator:
widely available (Nvidia external graphics processors are on 1/3 of all PCs), very inexpensive, powerful, but harder to program. Good cards cost only a few hundred dollars.
7. Software that might be taught, with reasons:
OpenMP C++ extension:
widely used, easy to use if your algorithm is parallelizable, backend is multicore Xeon.
Thrust C++ functional programming library:
FP is nice, hides low level details, backend can be any major parallel platform.
MATLAB: easy to use parallelism for operations that Mathworks has implemented in parallel, etc.
Mathematica: interesting powerful front end:
CUDA C++ extension and library for Nvidia:
8. The techniques learned here will also be applicable to larger parallel machines -- number 3 on the top 500 list uses NVIDIA GPUs, while number 2 uses Intel Xeon Phis. (Number 4 is a BlueGene.)
9. Effectively programming these processors will require in-depth knowledge about parallel programming principles, as well as the parallelism models, communication models, and resource limitations of these processors.

## 4   Prerequisite

ECSE-2660 CANOS or equivalent, knowledge of C++.

## 5   Why (Not) To Take This Course

Other courses might be a better fit if:

1. You don't like programming.
2. You don't like documenting your programs.
3. You don't like math.
5. You don't like writing exams at the official scheduled times. The final exam may be as late as Dec 20.

OTOH, here are some reasons that you might prefer to take a course from me.

1. I teach stuff that's fun and useful.
2. I acknowledge that you are simultaneously taking several courses, and so try to make the workload fair. E.g., if you're taking 6 3-credit courses, then you should not be required to spend more than $\frac{168}{6}$ hours per week per course :-).
3. I try to base exam questions more on important topics that occupied a lot of class time, and which are described in writing, often on this wiki.
4. I keep the course up-to-date and relevant.

## 6   Instructors

### 6.1   Professor

W. Randolph Franklin. BSc (Toronto), AM, PhD (Harvard)

### 6.2   Teaching assistant

1. Yin Li, liyN@rpi.edu, replacing N with 2*17 (5 hrs/wk)
2. Office hours:
1. ECSE Flip Flop lounge in JEC 6037.
2. Fri 4pm
3. Come near the start of the time; if there is no one there he may leave.
4. He will try to stay as long as there are students asking questions, but will leave after 15 minutes if no one has arrived.
5. If you need more time, or a different time, then write, and he will try to accommodate you.
6. He also attends most lectures.

## 7   Course websites

The homepage has lecture summaries, syllabus, homeworks, etc.

There is a separate page for important dates.

### 8.1   Text

There is no required text, but the following inexpensive books may be used.

1. Sanders and Kandrot, CUDA by example. It gets excellent reviews, although it is several years old. Amazon has many options, including Kindle and renting hardcopies.
2. Kirk and Hwu, 2nd edition, Programming massively parallel processors. It concentrates on CUDA.

One problem is that even recent books may be obsolete. For instance they may ignore the recent CUDA unified memory model, which simplifies CUDA programming at a performance cost. Even if the current edition of a book was published after unified memory was released, the author might not have updated the examples.

### 8.2   Web

There is a lot of free material on the web, which I'll reference, and may cache locally.

## 9   Computer systems used

This course will use (remotely via ssh) parallel.ecse.rpi.edu. It has:

1. a dual 14-core Intel Xeon E5-2660 2.0GHz
2. 256GB of DDR4-2133 ECC Reg memory
3. Nvidia GeForce GTX 1080 processor with 8GB
4. Intel Xeon Phi 7120A
5. Samsung Pro 850 1TB SSD
6. WD Red 6TB 6GB/s hard drive
7. CentOS
8. CUDA
9. OpenMP 4.0
10. Thrust

You may also use other compatible computers.

## 10   LMS

RPI LMS (formerly WebCT) will be used only for you to submit homeworks and for me to distribute grades.

Announcements and the homeworks themselves will be available on this website. You do not have to log in to see them.

## 11   Class times & places

Lectures are Mon & Thurs, 4-6pm, in JEC4107.

## 12   Lecture summaries and announcements

will be posted here.

## 13   Assessment measures, i.e., grades

1. There will be no exams.

2. The grade will be based on homeworks, a term project, class presentations, and possible iclicker questions TBD.

3. Deliverables for the term project:

1. A 2-minute project proposal given to the class around the middle of the semester.
1. A 5-minute project presentation given to the class in the last week.
1. Some progress reports.
1. A write-up uploaded on the last class day. This will contain an academic paper, code and perhaps video or user manual.

### 13.1   Homeworks

There will be frequent homeworks. You are encouraged to do the homework in teams of 2, and submit one solution per team, on RPILMS, in any reasonable format. The other term member should submit only a note listing the team and saying who submitted the solution.

"Reasonable" means a format that I can read. A scan of neat handwriting is acceptable. I would type material with a wiki like pmwiki or blogging tool, sketch figures with xournal or draw them with inkscape, and do the math with mathjax. Your preferences are probably different.

### 13.2   Term project

1. For the latter part of the course, most of your homework time will be spent on a term project.
2. You are encouraged do it in teams of up to 3 people. A team of 3 people would be expected to do twice as much work as 1 person.
3. You may combine this with work for another course, provided that both courses know about this and agree. I always agree.
4. If you are a grad student, you may combine this with your research, if your prof agrees, and you tell me.
5. You may build on existing work, either your own or others'. You have to say what's new, and have the right to use the other work. E.g., using any GPLed code or any code on my website is automatically allowable (because of my Creative Commons licence).
6. You will implement, demonstrate, and document something vaguely related to parallel computing.
7. You will give a 5 minute fast forward Powerpoint talk in class. A fast forward talk is a timed Powerpoint presentation, where the slides advance automatically.
8. You may demo it to the TA if you wish.

#### 13.2.1   Size of term project

It's impossible to specify how many lines of code makes a good term project. E.g., I take pride in writing code that is can be simultaneously shorter, more robust, and faster than some others. See my 8-line program for testing whether a point is in a polygon: Pnpoly.

According to Big Blues, when Bill Gates was collaborating with around 1980, he once rewrote a code fragment to be shorter. However, according to the IBM metric, number of lines of code produced, he had just caused that unit to officially do negative work.

#### 13.2.2   Deliverables

1. An implementation showing parallel computing.
2. An extended abstract or paper on your project, written up like a paper. You should follow the style guide for some major conference (I don't care which, but can point you to one).
3. A more detailed manual, showing how to use it.
4. A talk in class.

A 10-minute demonstration to the TA is optional. If you do, she will give me a modifier of up to 10 points either way. I.e., a good demo will help, a bad one hurt.

### 13.3   Correcting the Prof's errors

Occasionally I make mistakes, either in class or on the web site. The first person to correct each nontrival error will receive an extra point on his/her grade. One person may accumulate several such know-it-all points.

### 13.4   Iclickers

Iclicker questions will be posed in some classes. The questions are intended to be easy. Please bring your iclickers.

### 13.5   Missing or late work

1. We will drop the lowest homework grade. That will handle excused absences, unexcused absences, dying relatives, illnesses, team trips, and other problems.
2. If your term project is late, and you have an acceptable excuse for an incomplete, you will be given that, and the project will be graded next fall. Note that RPI has tightened the rules for incompletes; they are not automatic.

### 13.6   Grade distribution & verification

1. I'll post homework grading comments on LMS. Please report any errors disagreements or appeals by email within one week.
2. From time to time, I'll post your grades to LMS. Please report any missing grades within one week to the TA, with a copy to the prof.
3. It is not allowed to wait until the end of the semester, and then go back 4 months to try to find extra points. It is especially not allowed to wait until the end of the following semester, and then to ask what you may do to raise your grade.
4. I maintain standards (and the value of your diploma) by giving the grades that are earned, not the grades that are desired. Nevertheless, this course's average grade is competitive with other courses, and last year's students seemed to like the course.
5. Appeal first to the TA, then to the prof, then to any other prof in ECSE acting as a mediator (such as Prof Wozny, the curriculum chair), and then to the ECSE Head. It is preferable to state your objection in writing.

### 13.7   Mid-semester assessment

Before the drop date, I will email you your performance to date.

### 13.8   Early warning system (EWS)

As required by the Provost, we may post notes about you to EWS, for example, if you're having trouble doing homeworks on time, or miss an exam. E.g., if you tell me that you had to miss a class because of family problems, then I may forward that information to the Dean of Students office.

See the Student Handbook for the general policy. The summary is that students and faculty have to trust each other. After you graduate, your most important possession will be your reputation.

Specifics for this course are as follows.

1. You may collaborate on homeworks, but each team of 1 or 2 people must write up the solution separately (one writeup per team) using their own words. We willingly give hints to anyone who asks.
2. The penalty for two teams handing in identical work is a zero for both.
3. You may collaborate in teams of up to 3 people for the term project.
4. You may get help from anyone for the term project. You may build on a previous project, either your own or someone else's. However you must describe and acknowledge any other work you use, and have the other person's permission, which may be implicit. E.g., my web site gives a blanket permission to use it for nonprofit research or teaching. You must add something creative to the previous work. You must write up the project on your own.
5. However, writing assistance from the Writing Center and similar sources in allowed, if you acknowledge it.
6. The penalty for plagiarism is a zero grade.
7. Cheating will be reported to the Dean of Students Office.

## 15   Student feedback

Since it's my desire to give you the best possible course in a topic I enjoy teaching, I welcome feedback during (and after) the semester. You may tell me or write me or the TA, or contact a third party, such as Prof Gary Saulnier, the ECSE undergrad head, or Prof Mike Wozny, the ECSE Dept head.