PAR Class 27, Mon 2019-04-22

1   Final presentations

  1. Mon
    1. SP
    2. AJ
    3. ER
    4. ZJ
  2. Thu
    1. AG
    2. JG
    3. SM
    4. ZC
    5. HH
    6. ZL
    7. HB
    8. BM
    9. KS
    10. JC

2   Term project

  1. Details are in the syllabus, except that your talk is 10 minutes, not 15, and you're not required to demo it.
  2. You may base your paper on my 2018 Fall Workshop on Computational Geometry paper here. Or, you may use any other, equally formal, style.
  3. Please hand everything in by the end of the weekend. Putting everything on parallel might be easiest, in a dir named project. Then email me.

3   Course recap

  1. My teaching style is to work from particulars to the general.
  2. You've seen OpenMP, a tool for shared memory parallelism.
  3. You've seen the architecture of NVidia's GPU, a widely used parallel system, and CUDA, a tool for programming it.
  4. You've seen Thrust, a tool on top of CUDA, built in the C++ STL style.
  5. You've seen quantum computing, both the theory and programming IBM's quantum computer.
  6. You've seen how widely used numerical tools like BLAS and FFT have versions built on CUDA.
  7. You've had a chance to program in all of them on parallel.ecse, with dual 14-core Xeons, Pascal NVidia board
  8. You seen talks by leaders in high performance computing, such as Jack Dongarra.
  9. You've seen quick references to parallel programming using Matlab, Mathematica, and the cloud.
  10. Now, you can inductively reason towards general design rules for shared and non-shared parallel computers, and for the SW tools to exploit them.

4   Course survey

If you liked the course, then please officially tell RPI by completing the survey. Thanks.

PAR Class 26, Thu 2019-04-18

1   Final presentations

Everyone has now signed up for next week, Mon or Thurs.

https://doodle.com/poll/kcsek2wi39s97uzm

Nevertheless, I'll entertain tales of woe about the date.

2   Inspiration for finishing your term projects

  1. The Underhanded C Contest

    "The goal of the contest is to write code that is as readable, clear, innocent and straightforward as possible, and yet it must fail to perform at its apparent function. To be more specific, it should do something subtly evil. Every year, we will propose a challenge to coders to solve a simple data processing problem, but with covert malicious behavior. Examples include miscounting votes, shaving money from financial transactions, or leaking information to an eavesdropper. The main goal, however, is to write source code that easily passes visual inspection by other programmers."

  2. The International Obfuscated C Code Contest

  3. https://www.awesomestories.com/asset/view/Space-Race-American-Rocket-Failures

    Moral: After early disasters, sometimes you can eventually get things to work.

  4. The 'Wrong' Brothers Aviation's Failures (1920s)

  5. Early U.S. rocket and space launch failures and explosion

  6. Numerous US Launch Failures

3   Intel Xeon Phi 7120A

3.1   Summary

  1. Intel's answer to Nvidia
  2. The start of Wikipedia might be interesting.
  3. Parallel.ecse has a Phi.
  4. However, because of lack of demand, I haven't installed the current drivers, so it's not usable.
  5. In July 2018, Intel ended this product:
    1. https://www.nextplatform.com/2018/07/27/end-of-the-line-for-xeon-phi-its-all-xeon-from-here/
    2. https://www.networkworld.com/article/3296004/intel-ends-the-xeon-phi-product-line.html
  6. Nevertheless, anything Intel does is worth a look.
  7. Historically, Intel has introduced a number of products that failed. From time to time, one of their products also succeeds. Perhaps they should not start to work on projects that are going to fail. :-)
  8. Intel also incorporated many of the Phi's features into the regular Xeon.

3.2   In general

  1. The Xeon Phi is Intel's brand name for their MIC (for Many Integrated Core Architecture).

  2. The 7120a is Intel's Knights Landing (1st generation) MIC architecure, launched in 2014.

  3. It has 61 cores running about 244 threads clocked at about 1.3GHz.

    Having several threads per core helps to overcome latency in fetching stuff.

  4. It has 16GB of memory accessible at 352 GB/s, 30BM L2 cache, and peaks at 1TFlops double precision.

  5. It is a coprocessor on a card accessible from a host CPU on a local network.

  6. It is intended as a supercomputing competitor to Nvidia.

  7. The mic architecture is quite similar to the Xeon.

  8. However executables from one don't run on the other, unless the source was compiled to include both versions in the executable file.

  9. The mic has been tuned to emphasize floating performance at the expense of, e.g., speculative execution.

    This helps to make it competitive with Nvidia, even though Nvidia GPUs can have many more cores.

  10. Its OS is busybox, an embedded version of linux.

  11. The SW is called MPSS (Manycore Platform Software Stack).

  12. The mic can be integrated with the host in various ways that I haven't (yet) implemented.

    1. Processes on the host can execute subprocesses on the device, as happens with Nvidia CUDA.
    2. E.g., OpenMP on the host can run parallel threads on the mic.
    3. The mic can page virtual memory from the host.
  13. The fastest machine on top500.org a few years ago used Xeon Phi cards.

    The 2nd used Nvidia K20 cards, and the 3rd fastest was an IBM Bluegene.

    So, my course lets you use the 2 fastest architectures, and there's another course available at RPI for the 3rd.

  14. Information:

    1. https://en.wikipedia.org/wiki/Xeon_Phi
    2. http://ark.intel.com/products/80555/Intel-Xeon-Phi-Coprocessor-7120A-16GB-1_238-GHz-61-core
    3. http://www.intel.com/content/www/us/en/products/processors/xeon-phi/xeon-phi-processors.html
    4. http://www.intel.com/content/www/us/en/architecture-and-technology/many-integrated-core/intel-many-integrated-core-architecture.html
    5. https://software.intel.com/en-us/articles/intel-manycore-platform-software-stack-mpss
    6. https://pleiades.ucsc.edu/hyades/MIC_QuickStart_Guide

3.3   parallel.ecse's mic

  1. The hostname (of this particular MIC) is parallel-mic0 or mic0.

  2. The local filesystem is in RAM and is reinitialized when mic0 is rebooted.

  3. Parallel:/home and /parallel-class are NFS exported to mic0.

  4. /home can be used to move files back and forth.

  5. All the user accounts on parallel were given accounts on mic0.

  6. You can ssh to mic0 from parallel.

  7. Your current parallel ssh key pair should work.

  8. Your parallel login password as of a few days ago should work on mic0.

    However, future changes to your parallel password will not propagate to mic0 and you cannot change your mic0 password.

    (The mic0 setup snapshotted parallel's accounts and created a read-only image to boot mic0 from. Any changes to mic0:/etc/shadow are reverted when mic0 reboots.)

    So use your public key.

3.4   Programming the mic

  1. Parallel:/parallel-class/mic/bin has versions of gcc, g++, etc, with names like k1om-mpss-linux-g++ .

  2. These run on parallel and produce executable files that run (only) on mic0.

  3. Here's an example of compiling (on parallel) a C program in /parallel-class/mic

    bin/k1om-mpss-linux-gcc hello.c -o hello-mic
    
  4. Run it thus from parallel (it runs on mic0):

    ssh mic0  /parallel-class/mic/hello-mic
    

4   IBM Blue Gene

  1. IBM has also ended the Blue Gene product line.
  2. They're currently pushing an OpenPOWER POWER8 host plus Nivida (or other) devices.
  3. E.g., it tightly couples a POWER8 to several Nvidia RTX cards.

5   Cloud computing

(Enrichment; read on your own)

The material is from Wikipedia, which appeared better than any other sources that I could find.

  1. Hierarchy:

    1. IaaS (Infrastructure as a Service)
      1. Sample functionality: VM, storage
      2. Examples:
        1. Google_Compute_Engine
        2. Amazon_Web_Services
        3. OpenStack : compute, storage, networking, dashboard
    2. PaaS (Platform ...)
      1. Sample functionality: OS, Web server, database server
      2. Examples:
        1. OpenShift
        2. Cloud_Foundry
        3. Hadoop :
          1. distributed FS, Map Reduce
          2. derived from Google FS, map reduce
          3. used by Facebook etc.
      3. Now, people often run Apache Spark™ - Lightning-Fast Cluster Computing instead of Hadoop, because Spark is faster.
    3. SaaS (Software ...)
      1. Sample functionality: email, gaming, CRM, ERP
  2. Cloud_computing_comparison

  3. Virtual machine.

    The big question is, at what level does the virtualization occur? Do you duplicate the whole file system and OS, even emulate the HW, or just try to isolate files and processes in the same OS.

    1. Virtualization
    2. Hypervisor
    3. Xen
    4. Kernel-based_Virtual_Machine
    5. QEMU
    6. VMware
    7. Containers, docker,
    8. Comparison_of_platform_virtual_machines
  4. Distributed storage

    1. Virtual_file_system
    2. Lustre_(file_system)
    3. Comparison_of_distributed_file_systems
    4. Hadoop_distributed_file_system
  5. See also

    1. VNC
    2. Grid_computing
      1. decentralized, heterogeneous
      2. used for major projects like protein folding

6   More parallel tools

6.1   cuFFT Notes

  1. GPU Computing with CUDA Lecture 8 - CUDA Libraries - CUFFT, PyCUDA from Christopher Cooper, BU
  2. video #8 - CUDA 5.5 cuFFT FFTW API Support. 3 min.
  3. cuFFT is inspired by FFTW (the fastest Fourier transform in the west), which they say is so fast that it's as fast as commercial FFT packages.
  4. I.e., sometimes commercial packages may be worth the money.
  5. Although the FFT is taught for N a power of two, users often want to process other dataset sizes.
  6. The problem is that the optimal recursion method, and the relevant coefficients, depends on the prime factors of N.
  7. FFTW and cuFFT determine the good solution procedure for the particular N.
  8. Since this computation takes time, they store the method in a plan.
  9. You can then apply the plan to many datasets.
  10. If you're going to be processing very many datasets, you can tell FFTW or cuFFT to perform sample timing experiments on your system, to help in devising the best plan.
  11. That's a nice strategy that some other numerical SW uses.
  12. One example is Automatically Tuned Linear Algebra Software (ATLAS).

6.2   cuBLAS etc Notes

  1. BLAS is an API for a set of simple matrix and vector functions, such as multiplying a vector by a matrix.
  2. These functions' efficiency is important since they are the basis for widely used numerical applications.
  3. Indeed you usually don't call BLAS functions directly, but use higher-level packages like LAPACK that use BLAS.
  4. There are many implementations, free and commercial, of BLAS.
  5. cuBLAS is one.
  6. One reason that Fortran is still used is that, in the past, it was easier to write efficient Fortran programs than C or C++ programs for these applications.
  7. There are other, very efficient, C++ numerical packages. (I can list some, if there's interest).
  8. Their efficiency often comes from aggressively using C++ templates.
  9. SC12 Demo: Using CUDA Library to accelerate applications 7:11 https://www.youtube.com/watch?v=P2Ew4Ljyi6Y
  10. Matrix mult example

6.3   Matlab

  1. Good for applications that look like matrices.

    Considerable contortions required for, e.g., a general graph. You'd represent that with a large sparse adjacency matrix.

  2. Using explicit for loops is slow.

  3. Efficient execution when using builtin matrix functions,

    but can be difficult to write your algorithm that way, and

    difficult to read the code.

  4. Very expensive and getting more so.

    Many separately priced apps.

  5. Uses state-of-the-art numerical algorithms.

    E.g., to solve large sparse overdetermined linear systems.

    Better than Mathematica.

  6. Most or all such algorithms also freely available as C++ libraries.

    However, which library to use?

    Complicated calling sequences.

    Obscure C++ template error messages.

  7. Graphical output is mediocre.

    Mathematica is better.

  8. Various ways Matlab can execute in parallel

    1. Operations on arrays can execute in parallel.

      E.g. B=SIN(A) where A is a matrix.

    2. Automatic multithreading by some functions

      Various functions, like INV(a), automatically use perhaps 8 cores.

      The '8' is a license limitation.

      Which MATLAB functions benefit from multithreaded computation?

    3. PARFOR

      Like FOR, but multithreaded.

      However, FOR is slow.

      Many restrictions, e.g., cannot be nested.

      Matlab's introduction to parallel solutions

      Start pools first with: MATLABPOOL OPEN 12

      Limited to 12 threads.

      Can do reductions.

    4. Parallel Computing Server

      This runs on a parallel machine, including Amazon EC2.

      Your client sends batch or interactive jobs to it.

      Many Matlab toolboxes are not licensed to use it.

      This makes it much less useful.

    5. GPU computing

      Create an array on device with gpuArray

      Run builtin functions on it.

      Matlab's run built in functions on a gpu

  9. Parallel and GPU Computing Tutorials, Part 9: GPU Computing with MATLAB 6:19 https://www.youtube.com/watch?v=1WTIHzwJ4j0

6.4   Mathematica in parallel

You terminate an input command with shift-enter.

Some Mathematica commands:

Sin[1.]
Plot[Sin[x],{x,-2,2}]
a=Import[
 "/opt/parallel/mathematica/mtn1.dat"]
Information[a]
Length[a]
b=ArrayReshape[a,{400,400}]
MatrixPlot[b]
ReliefPlot[b]
ReliefPlot[b,Method->"AspectBasedShading"]
ReliefPlot[MedianFilter[b,1]]
Dimensions[b]
Eigenvalues[b]   *When you get bored*
   * waiting, type * alt-.
Eigenvalues[b+0.0]
Table[ {x^i y^j,x^j y^i},{i,2},{j,2}]
Flatten[Table[ {x^i y^j,x^j y^i},{i,2},{j,2}],1]
StreamPlot[{x*y,x+y},{x,-3,3},{y,-3,3}]
$ProcessorCount
$ProcessorType
*Select *Parallel Kernel Configuration*
   and *Status* in the *Evaluation* menu*
ParallelEvaluate[$ProcessID]
PrimeQ[101]
Parallelize[Table[PrimeQ[n!+1],{n,400,500}]]
merQ[n_]:=PrimeQ[2^n-1]
Select[Range[5000],merQ]
ParallelSum[Sin[x+0.],{x,0,100000000}]
Parallelize[  Select[Range[5000],merQ]]
Needs["CUDALink`"]  *note the back quote*
CUDAInformation[]
Manipulate[n, {n, 1.1, 20.}]
Plot[Sin[x], {x, 1., 20.}]
Manipulate[Plot[Sin[x], {x, 1., n}], {n, 1.1, 20.}]
Integrate[Sin[x]^3, x]
Manipulate[Integrate[Sin[x]^n, x], {n, 0, 20}]
Manipulate[{n, FactorInteger[n]}, {n, 1, 100, 1}]
Manipulate[Plot[Sin[a x] + Sin[b x], {x, 0, 10}],
    {a, 1, 4}, {b, 1, 4}]

Unfortunately there's a problem that I'm still debugging with the Mathematica - CUDA interface.

Applications of GPU Computation in Mathematica (42:18) https://www.youtube.com/watch?v=LZAZ4ddUMKU

PAR Class 25, Mon 2019-04-15

2   Software tips

2.1   Git

Git is good to simultaneously keep various versions. A git intro:

Create a dir for the project:

mkdir PROJECT; cd PROJECT

Initialize:

git init

Create a branch (you can do this several times):

git branch MYBRANCHNAME

Go to a branch:

git checkout MYBRANCHNAME

Do things:

vi, make, ....

Save it:

git add .; git commit -mCOMMENT

Repeat

I might use this to modify a program for class.

2.2   Freeze decisions early: SW design paradigm

One of my rules is to push design decisions to take effect as early in the process execution as possible. Constructing variables at compile time is best, at function call time is 2nd, and on the heap is worst.

  1. If I have to construct variables on the heap, I construct few and large variables, never many small ones.

  2. Often I compile the max dataset size into the program, which permits constructing the arrays at compile time. Recompiling for a larger dataset is quick (unless you're using CUDA).

    Accessing this type of variable uses one less level of pointer than accessing a variable on the heap. I don't know whether this is faster with a good optimizing compiler, but it's probably not slower.

  3. If the data will require a dataset with unpredictably sized components, such as a ragged array, then I may do the following.

    1. Read the data once to accumulate the necessary statistics.
    2. Construct the required ragged array.
    3. Reread the data and populate the array.

3   CPPCON

CppCon is the annual, week-long face-to-face gathering for the entire C++ community. https://cppcon.org/

https://cppcon2018.sched.com/

CppCon 2014: Herb Sutter "Paying for Lunch: C++ in the ManyCore Age" https://www.youtube.com/watch?v=AfI_0GzLWQ8

CppCon 2016: Combine Lambdas and weak_ptrs to make concurrency easy (4min) https://www.youtube.com/watch?v=fEnnmpdZllQ

A Pragmatic Introduction to Multicore Synchronization by Samy Al Bahra. https://www.youtube.com/watch?v=LX4ugnzwggg

CppCon 2018: Tsung-Wei Huang “Fast Parallel Programming using Modern C++” https://www.youtube.com/watch?v=ho9bqIJkvkc&list=PLHTh1InhhwT7GoW7bjOEe2EjyOTkm6Mgd&index=13

CppCon 2018: Anny Gakhokidze “Workflow hacks for developers” https://www.youtube.com/watch?v=K4XxeB1Duyo&list=PLHTh1InhhwT7GoW7bjOEe2EjyOTkm6Mgd&index=33

CppCon 2018: Bjarne Stroustrup “Concepts: The Future of Generic Programming (the future is here)” https://www.youtube.com/watch?v=HddFGPTAmtU

CppCon 2018: Herb Sutter “Thoughts on a more powerful and simpler C++ (5 of N)” https://www.youtube.com/watch?v=80BZxujhY38

CppCon 2018: Jefferson Amstutz “Compute More in Less Time Using C++ Simd Wrapper Libraries” https://www.youtube.com/watch?v=80BZxujhY38

CppCon 2018: Geoffrey Romer “What do you mean "thread-safe"?” https://www.youtube.com/watch?v=s5PCh_FaMfM

PAR Class 24, Thu 2019-04-11

Table of contents::

2   Others

Experiment with Basic Quantum Algorithms (Ali Javadi-Abhari, ISCA 2018) (19:05) https://www.youtube.com/watch?v=M1UHi9UXTWI&list=PLOFEBzvs-VvruANdBhTb-9YRDes07pZWQ&index=2

PAR Class 23, Mon 2019-04-08

1   Supercomputing 2018 (SC18) videos

SC18 Invited Talk: Matthias Troyer https://www.youtube.com/watch?v=97s3FEhG14g (45:46)

This is very nice. I'll show the start and then a highlight, starting at 19:00.

SC18: NVIDIA CEO Jensen Huang on the New HPC (1:44:46) https://www.youtube.com/watch?v=PQbhxfRH2H4

We'll watch the wrapup,

SCinet Must See: How to Build the World’s Fastest Temporary Computer Network Time Lapse (0:57) https://sc18.supercomputing.org/scinet-must-see-how-to-build-the-worlds-fastest-temporary-computer-network-time-lapse/

3   Jack Dongarra

He has visited RPI more than once.

Algorithms for future emerging technologies (2:55:45) https://www.youtube.com/watch?v=TCgHNMezmZ8

We'll watch the start. You're encouraged to watch it all. It gets more technical later.

Stony Brook talk https://www.youtube.com/watch?v=D1hfrtoVZDo

4   Shor's algorithm

Shor on, what is Shor's factoring algorithm? (2:09) https://www.youtube.com/watch?v=hOlOY7NyMfs

Hacking at Quantum Speed with Shor's Algorithm (16:35) https://kcts9.org/programs/digital-studios-infinite-series/episodes/0121

43 Quantum Mechanics - Quantum factoring Period finding (19:27) https://www.youtube.com/watch?v=crMM0tCboZU

44 Quantum Mechanics - Quantum factoring Shor's factoring algorithm (25:42) https://www.youtube.com/watch?v=YhjKWAMFBUU

PAR Class 22, Thu 2019-04-04

2   Quantum computing summary, another view

  1. Hadamard matrix rotates the pure state to an entangled superposition.
  2. Then we operate in parallel on each state in the superposition.
  3. Finally we separate out the desired answer.

PAR Class 21, Mon 2019-04-01

2   Quantum computing on youtube

  1. Quantum Algorithms (2:52)

    Which problems can quantum computers solve exponentially faster than classical computers? David Gosset, IBM quantum computing research scientist, explains why algorithms are key to finding out.

  2. David Deutsch - Why is the Quantum so Strange? (8:43)

  3. Grover's Algorithm (9:57)

    An overview of Grover's Algorithm. An unstructured search algorithm that can find an item in a list much faster than a classical computer can. Several sources are listed.

  4. Bob Sutor demonstrates the IBM Q quantum computer (6:53)

  5. Can we make quantum technology work? | Leo Kouwenhoven | TEDxAmsterdam (18:19)

  6. "Spooky" physics | Leo Kouwenhoven | TEDxDelft (18:00)

3   Quantum computing

ctd

  1. Algorithms:
    1. Some, but not all, are faster.
    2. Bounded-error quantum polynomial time (BQP)
      1. "is the class of decision problems solvable by a quantum computer in polynomial time, with an error probability of at most 1/3 for all instances" - https://en.wikipedia.org/wiki/BQP
      2. Includes integer factorization and discrete log.
      3. Relation to NP is unknown (big unsolved problem).
    3. Grover's algorithm:
      1. https://en.wikipedia.org/wiki/Grover%27s_algorithm
      2. Given a black box with N inputs and 1 output.
      3. Exactly one input makes the output 1.
      4. Problem: which one?
      5. Classical solution: Try each input, T=N.
      6. Quantum: $T=\sqrt(N)$.
      7. Probabilistic.
      8. Apps: mean, median, reverse a crypto hash, find collisions, generate false blocks.
      9. Can extend to quantum partial search.
      10. Grover's algorithm is optimal.
      11. This suggests that NP is not in BQP .
    4. Shor's algorithm:
      1. Factorize an int.
      2. in BQP.
      3. almost exponentially faster than best classical algorithm.
      4. Largest examples I can find:
        1. 56153 = 233 × 241.
        2. https://medium.com/@aditya.yadav/rsa-2048-cracked-using-shors-algorithm-on-a-quantum-computer-660cb2297a95
  2. https://quantumexperience.ng.bluemix.net/qx/tutorial ctd.

PAR Class 20, Thu 2019-03-28

2   Quantum computing

  1. https://www.quantum-inspire.com/kbase/introduction-to-quantum-computing/

  2. More from

    Quantum Computing for Computer Scientists 1st Edition ( recommended in Microsoft talk)

    1. Compare byte with qbyte.

    2. State of byte is 8 bits.

    3. State of qbyte is 256 complex weights.

    4. They all get modified by each operation (aka matrix multiply).

    5. That is the power of quantum computing.

    6. The current limitations are that IBM does only a few qbits and that the operation is noisy.

    7. No cloning: Cannot copy a qbit, but can move it.

    8. Copied from page 171: All quantum algorithms work with the following basic framework:

      1. The system will start with the qubits in a particular classical state.
      2. From there the system is put into a superposition of many states.
      3. This is followed by acting on this superposition with several unitary operations.
      4. And finally, a measurement of the qubits.
    9. Ways to look at measurement:

      1. Converts qbit to classical bit.
      2. Is an interaction with the external world.
      3. Information is not lost, but leaks into the external world.
      4. Is an operator represented by a matrix.
      5. that is Hermitian, i.e., it's equal to its complex transpose.
      6. For physical systems, some operators compute the system's momentum, position, or energy.
      7. The matrix's eigenvalues are real.
      8. The result of the operation is one of the eigenvalues.
    10. https://en.m.wikipedia.org/wiki/Bloch_sphere:

      1. The state of a qbit can be represented as a point on or in a sphere of radius 1.
      2. E.g., |1> is the north pole, |0> the south pole.
      3. Many operations are rotations.
    11. Common operations (aka gates):

      https://en.wikipedia.org/wiki/Quantum_logic_gate

      1. Hadamard.

        Creates a superposition.

        https://cs.stackexchange.com/questions/63859/intuition-behind-the-hadamard-gate

      2. Toffoli, aka CCNOT.

        Universal for classical boolean functions.

        (a,b,c) -> (a,b, c xor (a and b))

        https://en.wikipedia.org/wiki/Toffoli_gate

      3. Fredkin, aka CSWAP.

        https://en.wikipedia.org/wiki/Fredkin_gate

        3 inputs. Swaps last 2 if first is true.

        sample app: 5 gates make a 3-bit full adder.

  3. Algorithms:

    1. Some, but not all, are faster.
    2. Bounded-error quantum polynomial time (BQP)
      1. "is the class of decision problems solvable by a quantum computer in polynomial time, with an error probability of at most 1/3 for all instances" - https://en.wikipedia.org/wiki/BQP
      2. Includes integer factorization and discrete log.
      3. Relation to NP is unknown (big unsolved problem).
    3. Grover's algorithm:
      1. https://en.wikipedia.org/wiki/Grover%27s_algorithm
      2. Given a black box with N inputs and 1 output.
      3. Exactly one input makes the output 1.
      4. Problem: which one?
      5. Classical solution: Try each input, T=N.
      6. Quantum: $T=\sqrt(N)$.
      7. Probabilistic.
      8. Apps: mean, median, reverse a crypto hash, find collisions, generate false blocks.
      9. Can extend to quantum partial search.
      10. Grover's algorithm is optimal.
      11. This suggests that NP is not in BQP .
    4. Shor's algorithm:
      1. Factorize an int.
      2. in BQP.
      3. almost exponentially faster than best classical algorithm.
      4. Largest examples I can find:
        1. 56153 = 233 × 241.
        2. https://medium.com/@aditya.yadav/rsa-2048-cracked-using-shors-algorithm-on-a-quantum-computer-660cb2297a95
  4. https://quantumexperience.ng.bluemix.net/qx/tutorial ctd.

PAR Class 19, Mon 2019-03-25

2   Quantum computing

  1. The book that I like most (today) is this:

    1. Quantum Computing for Computer Scientists 1st Edition ( recommended in Microsoft talk)

      Example of superimposition: 2 slit experiment, page 93. Weights of either slit: 1/sqrt(2). Spread out weights for each slit (all /sqrt(6)): -1+i, -1-i, 1-i.

  2. My summary so far.

    1. Qbit.
    2. Its state is 0 or 1.
    3. Several qbits.
    4. The combined state is the tensor product of the individual qbits.
    5. For $n$ qbits, the tensor product is a vector with $2^n$ elements, one element for each possible value of each qbit.
    6. Each element of the tensor product has a complex weight.
    7. The physical meaning is that the system is simultaneously in a superimposition of all the possible combos of values of the component qbits.
    8. It is wrong to think that the system is really in one of the states, but you don't know which one. This is the hidden variable theory. It has been proved experimentally to be false.
    9. The probability of it being in a particular single state is proportional to the squared magnitude of its complex weight.
    10. For some sets of weights, the combined state cannot be separated into a tensor product of individual qbits. In this case, the individual qbits are entangled.
    11. You transform a state by multiplying it by a matrix.
    12. The matrix is invertible.
    13. The transformation doesn't destroy information.
    14. When you measure a state, it collapses into one of the component states. (This may be inaccurate.)
    15. The probability of collapsing into a particular state is the squared magnitude of its complex weight.

PAR Class 18, Thu 2019-03-21

PAR Class 15, Mon 2019-03-11

2   OpenMP: CPU time vs wall clock (elapsed) time

On parallel.ecse, /parallel-class/openmp/rpi/cpuvs.wall.cc is a simple program that shows how wallclock time shrinks as the number of threads rises (up to a point) but CPU time rises from the start.

That is because one way that a thread waits to run is to burn cycles. For small delays, this is easier and has less overhead than explicitly suspending. The GNU version of OpenMP lets you choose the crossover point between buring cycles and syspending a thread.

3   Thrust

3.1   Bug

I found and reported a bug in version 100904. This version does not work with OpenMP. It was immediately closed because they already knew about it. They just hadn't told us users.

Awhile ago I reported a bug in nvcc, where it went into an infinite loop for a certain array size. The next minor version of CUDA was released the next day.

Two observations:

  1. I'm good at breaking SW by expecting it to meet its specs.
  2. Nvidia is responsive, which I like.

3.2   More info

  1. /parallel-class/thrust/doc/An_Introduction_To_Thrust.pdf
  2. GTC_2010_Part_2_Thrust_By_Example.pdf

3.3   Examples ctd

The point of these examples is to show you some techniques for parallel programming that are not obvious. I.e., you probably wouldn't think of them if you did not already know them.

  1. repeated_range2.cu is my improvement on repeated_range.cu:

    auto output=make_permutation_iterator(data.begin(), make_transform_iterator(make_counting_iterator(0), _1/3));

    1. make_transform_iterator(make_counting_iterator(0), _1/3)) returns pointers to the sequence 0,0,0,1,1,1,2,2,2, ...
  2. Unmodified thrust examples:

    1. expand.cu takes a vector like V= [0, 10, 20, 30, 40] and a vector of repetition counts, like C= [2, 1, 0, 3, 1]. Expand repeats each element of V the appropriate number of times, giving [0, 0, 10, 30, 30, 30, 40]. The process is as follows.

      1. Since the output vector will be longer than the input, the main program computes the output size, by reduce summing C, and constructs a vector to hold the output.
      2. Exclusive_scan C to obtain output offsets for each input element: C2 = [0, 2, 3, 3, 6].
      3. Scatter_if the nonzero counts into their corresponding output positions. A counting iterator, [0, 1, 2, 3, 4] is mapped with C2, using C as the stencil, giving C3 = [0, 0, 1, 3, 0, 0, 4].
      4. An inclusive_scan with max fills in the holes in C3, to give C4 = [0, 0, 1, 3, 3, 3, 4].
      5. Gather uses C4 to gather elements of V: [0, 0, 10, 30, 30, 30, 40].
    2. set_operations.cu. This shows methods of handling an operation whose output is of unpredictable size. The question is, is space or time more important?

      1. If the maximum possible output size is reasonable, then construct an output vector of that size, use it, and then erase it down to its actual size.

      2. Or, run the operation twice. The 1st time, write to a discard_iterator, and remember only the size of the written data. Then, construct an output vector of exactly the right size, and run the operation again.

        I use this technique a lot with ragged arrays in sequential programs.

    3. sparse_vector.cu represents and sums sparse vectors.

      1. A sparse vector has mostly 0s.

      2. The representation is a vector of element indices and another vector of values.

      3. Adding two sparse vectors goes as follows.

        1. Allocate temporary index and element vectors of the max possible size (the sum of the sizes of the two inputs).

        2. Catenate the input vectors.

        3. Sort by index.

        4. Find the number of unique indices by applying inner_product with addition and not-equal-to-next-element to the indices, then adding one.

          E.g., applied to these indices: [0, 3, 3, 4, 5, 5, 5, 8], it gives 5.

        5. Allocate exactly enough space for the output.

        6. Apply reduce_by_key to the indices and elements to add elements with the same keys.

          The size of the output is the number of unique keys.

  3. What's the best way to sort 16000 sets of 1000 numbers each? E.g., sort the rows of a 16000x1000 array? On geoxeon, /pc/thrust/rpi/comparesorts.cu, which I copied from http://stackoverflow.com/questions/28150098/how-to-use-thrust-to-sort-the-rows-of-a-matrix|stackoverflow, compares three methods.

  4. Call the thrust sort 16000 times, once per set. That took 10 secs.

  5. Sort the whole list of 16,000,000 numbers together. Then sort it again by key, with the keys being the set number, to bring the elements of each set together. Since the sort is stable, this maintains the order within each set. (This is also how radix sort works.) That took 0.04 secs.

  6. Call a thrust function (to sort a set) within another thrust function (that applies to each set). This is new in Thrust 1.8. That took 0.3 secs.

This is a surprising and useful paradigm. It works because

  1. There's an overhead to starting each thrust function, and
  2. Radix sort, which thrust uses for ints, takes linear time.

3.4   Backends

  1. The Thrust device can be CUDA, OpenMP, TBB, etc.

  2. You can spec it in 2 ways:

    1. by adding an extra arg at the start of a function's arg list.

    2. with an envar

      https://github.com/thrust/thrust/wiki/Host-Backends

      https://github.com/thrust/thrust/wiki/Device-Backends

PAR Class 14, Thu 2019-02-28

2   Several forms of C++ functions

  1. Traditional top level function

    auto add(int a, int b) { return a+b;}

    You can pass this to a function, really pass a pointer to the function. It doesn't optimize across the call.

  2. Overload operator() in a new class

    Each different variable of the class is a different function. The function can use the variable's value. This is a closure.

    This is local to the containing block.

    This form optimizes well.

  3. Lambda, or anon function.

    auto add = [](int a, int b) { return a+b;};

    This is local to the containing block.

    This form optimizes well.

  4. Placeholder notation.

    As an argument in, e.g., transform, you can do this:

    transform(..., _1+_2);

    This is nice and short.

    As this is implemented by overloading the operators, the syntax of the expression is limited to what was overloaded.

3   Thrust

  1. Continue Stanford's parallel course notes.
    1. Lecture 8-: Thrust ctd.

3.1   Examples

  1. I rewrote /parallel-class/thrust/examples-1.8/tiled_range.cu into /parallel-class/thrust/rpi/tiled_range2.cu .

    It is now much shorter and much clearer. All the work is done here:

    gather(make_transform_iterator(make_counting_iterator(0), _1%N), make_transform_iterator(make_counting_iterator(N*C), _1%N), data.begin(), V.begin());

    1. make_counting_iterator(0) returns pointers to the sequence 0, 1, 2, ...
    2. _1%N is a function computing modulo N.
    3. make_transform_iterator(make_counting_iterator(0), _1%N) returns pointers to the sequence 0%N, 1%N, ...
    4. gather populates V. The i-th element of V gets make_transform_iterator...+i element of data, i.e., the i%N-th element of data.
  2. tiled_range3.cu is even shorter. Instead of writing an output vector, it constructs an iterator for a virtual output vector:

    auto output=make_permutation_iterator(data, make_transform_iterator(make_counting_iterator(0), _1%N));

    1. *(output+i) is *(data+(i%N)).
    2. You can get as many tiles as you want by iterating.
    3. tiled_range3.cu also constructs an iterator for a virtual input vector (in this case a vector of squares) instead of storing the data:

    auto data = make_transform_iterator(make_counting_iterator(0), _1*_1);

  3. tiled_range5.cu shows how to use a lambda instead of the _1 notation:

    auto output=make_permutation_iterator(data, make_transform_iterator(make_counting_iterator(0), [](const int i){return i%N;} ));

    1. You have to compile with --std c++11 .

    2. This can be rewritten thus:

      auto f = [](const int i){return i%N;}; auto output = make_permutation_iterator(data, make_transform_iterator(make_counting_iterator(0), f));

    3. The shortest lambda is this:

      auto f = [](){};

  4. repeated_range2.cu is my improvement on repeated_range.cu:

    auto output=make_permutation_iterator(data.begin(), make_transform_iterator(make_counting_iterator(0), _1/3));

    1. make_transform_iterator(make_counting_iterator(0), _1/3)) returns pointers to the sequence 0,0,0,1,1,1,2,2,2, ...

PAR Class 13, Mon 2019-02-25

Table of contents::

1   Thrust

  1. The github repository, with demo is https://github.com/thrust/thrust.git .

    Nvidia's proprietary version is slightly newer.

  2. The most comprehensive doc is online at http://thrust.github.io/doc/index.html

    It is badly written and slightly obsolete.

  3. There are various tutorials online, most obsolescent. E.g., they don't use C++-11 lambdas, which are a big help.

  4. Look at some Thrust programs in /parallel-class/cuda/thrust

  5. Continue Stanford's parallel course notes.

    1. Lecture 8-: Thrust ctd.
  6. One Nvidia-sponsored alternative is agency, at https://github.com/agency-library/ .

  7. There are other alternatives that I'll mention later.

  8. The alternatives are lower-level (= faster and harder to use) and newer (= possibly less debugged, fewer users).

PAR Class 11, Tues 2019-02-19

1   Student talks in class, round 2

  1. Choose a topic (or group of topics) from the 589 on-demand sessions at the GPU Tech Conference.

  2. Summarize it in class. OK to show the whole presentation (or parts of it) with your commentary.

  3. Same timing and grouping rules as round 1.

  4. Presentation dates: March 11, 14, 18.

  5. No need to sign up for your topic since there'll probably be few overlaps. However here's a signup site for your presentation date.

    https://doodle.com/poll/27nkaqkies74zz7e

    I've tentatively allowed up to 6 talks per day.

    Please enter your name and rcsid.

2   Term project

  1. See the syllabus.
  2. Proposal due March 14.
  3. Progress reports on March 28, April 15.
  4. Presentations in class April 18, 22, 25.
  5. Paper etc due on April 25.

3   NVidia and CUDA ctd

  1. Continue Stanford's parallel course notes.
  2. Continue my notes on Nvidia and CUDA.

PAR Class 9, Mon 2019-02-11

1   OpenMP matmul homework

Would anyone like to present a solution?

2   NVidia and CUDA ctd

  1. Continue Stanford's parallel course notes.
  2. Continue my notes on Nvidia and CUDA.
  3. Sample programs.
    1. /local/cuda/samples has many programs by Nvidia. Some interesting ones:
      1. /local/cuda/samples/1_Utilities/deviceQuery/deviceQuery describes the host's GPUs.
      2. /local/cuda/samples/0_Simple/ does some benchmarks.
    2. Stanfords examples are in parallel-class/stanford/tutorials .

PAR Class 8, Thurs 2019-02-07

1   NVidia device architecture, CUDA ctd

  1. The above shared memory model hits a wall; CUDA handles the other side of the wall.
  2. See Stanford's parallel course notes, which they've made freely available. (Thanks.)
    1. They're part of Nvidia's Academic programs site.
    2. They are very well done.
    3. They are obsolete in some places, which I'll mention.
    4. However, a lot of newer CUDA material is also obsolete.
    5. Nevertheless, the principles are still mostly valid.

2   Nvidia conceptual hierarchy

As always, this is as I understand it, and could be wrong. Nvidia uses their own terminology inconsistently. They may use one name for two things (E.g., Tesla and GPU), and may use two names for one thing (e.g., module and accelerator). As time progresses, they change their terminology.

  1. At the bottom is the hardware micro-architecture. This is an API that defines things like the available operations. The last several Nvidia micro-architecture generations are, in order, Tesla (which introduced unified shaders), Fermi, Kepler, Maxwell (introduced in 2014), Pascal (2016), and Volta (2018).
  2. Each micro-architecture is implemented in several different microprocessors. E.g., the Kepler micro-architecture is embodied in the GK107, GK110, etc. Pascal is GP104 etc. The second letter describes the micro-architecture. Different microprocessors with the same micro-architecture may have different amounts of various resources, like the number of processors and clock rate.
  3. To be used, microprocessors are embedded in graphics cards, aka modules or accelerators, which are grouped into series such as GeForce, Quadro, etc. Confusingly, there is a Tesla computing module that may use any of the Tesla, Fermi, or Kepler micro-architectures. Two different modules using the same microprocessor may have different amounts of memory and other resources. These are the components that you buy and insert into a computer. A typical name is GeForce GTX1080.
  4. There are many slightly different accelerators with the same architecture, but different clock speeds and memory, e.g. 1080, 1070, 1060, ...
  5. The same accelerator may be manufactured by different vendors, as well as by Nvidia. These different versions may have slightly different parameters. Nvidia's reference version may be relatively low performance.
  6. The term GPU sometimes refers to the microprocessor and sometimes to the module.
  7. There are four families of modules: GeForce for gamers, Quadro for professionals, Tesla for computation, and Tegra for mobility.
  8. Nvidia uses the term Tesla in two unrelated ways. It is an obsolete architecture generation and a module family.
  9. Geoxeon has a (Maxwell) GeForce GTX Titan and a (Kepler) Tesla K20xm. Parallel has a (Pascal) GeForce GTX 1080. We also have an unused (Kepler) Quadro K5000.
  10. Since the highest-end (Tesla) modules don't have video out, they are also called something like compute modules.

3   GPU range of speeds

Here is an example of the wide range of Nvidia GPU speeds; all times are +-20%.

The GTX 1080 has 2560 CUDA cores @ 1.73GHz and 8GB of memory. matrixMulCUBLAS runs at 3136 GFlops. However the reported time (0.063 msec) is so small that it may be inaccurate. The quoted speed of the 1080 is about triple that. I'm impressed that the measured performance is so close.

The Quadro K2100M in my Lenovo W540 laptop has 576 CUDA cores @ 0.67 GHz and 2GB of memory. matrixMulCUBLAS runs at 320 GFlops. The time on the GPU was about .7 msec, and on the CPU 600 msec.

It's nice that the performance almost scaled with the number of cores and clock speed.

4   CUDA

4.1   Versions

  1. CUDA has a capability version, whose major number corresponds to the micro-architecture generation. Kepler is 3.x. The K20xm is 3.5. The GTX 1080 is 6.1. Here is a table of the properties of different compute capabilities. However, that table is not completely consistent with what deviceQuery shows, e.g., the shared memory size.

  2. nvcc, the CUDA compiler, can be told which capabilities (aka architectures) to compile for. They can be given as a real architecture, e.g., sm_61, or a virtual architecture. e.g., compute_61.

    Just use the option -arch=compute_61.

  3. The CUDA driver and runtime also have a software version, defining things like available C++ functions. The latest is 9.1. This is unrelated to the capability version.

4.2   Stanford lectures

  1. On the web server.
  2. On geoxeon: /parallel-class/stanford/lectures/
  3. Lecture 4: how to
    1. cache data into shared memory for speed, and
    2. use hierarchical sync.

4.3   Misc

  1. With CUDA, the dominant problem in program optimization is optimizing the data flow. Getting the data quickly to the cores is harder than processing it. It helps big to have regular arrays, where each core reads or writes a successive entry.

    This is analogous to the hardware fact that wires are bigger (hence, more expensive) than gates.

  2. That is the opposite optimization to OpenMP, where having different threads writing to adjacent addresses will cause the false sharing problem.

  3. Nvidia CUDA FAQ

    1. has links to other Nvidia docs.
    2. is a little old. Kepler and Fermi are 2 and 3 generations old.

PAR Class 7, Mon 2019-02-04

1   OpenMP lessons

What you should have learned from the OpenMP lectures:

  1. How to use a well-designed and widely used API that is useful on almost all current CPUs.
  2. Shared memory: the most common parallel architecture.
  3. How to structure a program to be parallelizable.
  4. Issues in parallel programs, like nondeterminism, race conditions, critical regions, etc.

2   Types of memory allocation

Here's a brief overview of my understanding of the various places that you can assign memory in a program.

  1. Static. Define a fixed-size array global array. The variable is constructed at compile time, so accesses might perhaps be faster. Global vars with non default initial values increase the executable file size. If they're large enough, you need to use the compiler option -mcmodel=medium or -mcmodel=large. I don't know the effect on the program's size or speed.
  2. Stack. Define local arrays, that are created and freed as the routine is entered and exited. Their addresses relative to the base of this call frame may be constant. The default stack size is 8MB. You can increase this with the command ulimit or in the program as shown in stacksize.cc. I believe that in OpenMP, the max stacksize may be allocated when each thread is created. Then, a really big stackssize might have a penalty.
  3. Heap. You use new and destroy. Variables are constructed whenever you want. The more objects on the heap, the more time that each new or destroy takes. If you have lots of objects consider using placement new or creating an array of them.

I like to use static, then stack, and heap only when necessary.

Google's allocator is noticeably better than the default one. To use it, link your programs with -ltcmalloc. You can often use it on an existing executable foo thus:

LD_PRELOAD="/usr/lib/libtcmalloc.so" foo

I found it to save 15% to 30% in time.

Another memory concern is speed. Parallel has a NUMA (Non Uniform Memory Architecture). It has two 14-core Xeons. Each core has 128GB of main memory. Although all 256GB are in a common address space, accessing memory on same core as the thread is running on is faster.

The following is what I think based on some research, but may be wrong: A 4KB page of memory is assigned to a specific core when it is first written (not when it is reserved). So, each page of a large array may be on a different core. This can be used to optimize things. This gets more fun with 8-processor systems.

All that is separate from cache issues.

You can also assign your OpenMP threads to specific cores. This affects speed in ways I don't understand. The issues are resource sharing vs conflicts.

3   Optional Homework - bring your answers and discuss next week

  1. Write a program to multiply two 100x100 matrices. Do it the conventional way, not using anything fancy like Schonhage-Strassen. Now, see how much improvement you can get with OpenMP. Measure only the elapsed time for the multiplication, not for the matrix initialization.

    Report these execution times.

    1. W/o openmp enabled (Don't use -fopenmp. Comment out the pragmas.)
    2. With openmp, using only 1 thread.
    3. Using 2, 4, 8, 16, 32, 64 threads.
  2. Write programs to test the effect of the reduction pragma:

    1. Create an array of 1,000,000,000 floats and fill it with pseudorandom numbers from 0 to 1.
    2. Do the following tests with 1, 2, 4, 8, 16, and 32 threads.

    Programs to write and test:

    1. Sum it with a simple for loop. This will give a wrong answer with more than 1 thread, but is fast.
    2. Sum it with the subtotal variable protected with a atomic pragma.
    3. Sum it with the subtotal variable protected with a critical pragma.
    4. Sum it with a reduction loop.
  3. Devise a test program to estimate the time to execute a task pragma. You might start with use taskfib.cc.

  4. Sometime parallelizing a program can increase its elapsed time. Try to create such an example, with 2 threads being slower than 1.

4   NVidia device architecture, start of CUDA

  1. The above shared memory model hits a wall; CUDA handles the other side of the wall.
  2. See Stanford's parallel course notes, which they've made freely available. (Thanks.)
    1. They are very well done.
    2. They are obsolete in some places, which I'll mention.
    3. However, a lot of newer CUDA material is also obsolete.
    4. Nevertheless, the principles are still mostly valid.

PAR Class 6, Thurs 2019-01-31

1   Student talks

A few more talks.

2   OpenMP

  1. Now that we've seen some examples, look at the LLNL tutorial. https://computing.llnl.gov/tutorials/openMP/

  2. We'll see some more examples running on parallel, at /parallel-class/openmp/rpi

  3. We saw that running even a short OpenMP program could be unpredictable. The reason for that was a big lesson.

    There are no guarantees about the scheduling of the various threads. There are no guarantees about fairness of resource allocation between the threads. Perhaps thread 0 finishes before thread 1 starts. Perhaps thread 1 finishes before thread 2 starts. Perhaps thread 0 executes 3 machine instructions, then thread 1 executes 1 instruction, then thread 0 executes 100 more, etc. Each time the process runs, something different may happen. One thing may happen almost all the time, to trick you.

    This happened during the first space shuttle launch in 1981. A 1-in-70 chance event prevented the computers from synchonizing. This had been observed once before during testing. However when they couldn't get it to happen again, they ignored it.

    This interlacing of different threads can happen at the machine instruction level. The C statement K=K+3 can translate into the machine instructions

    LOAD K; ADD 3; STORE K.

    Let's color the threads red and blue.

    If the threads interlace thus:

    LOAD K; ADD 3; STORE K; LOAD K; ADD 3; STORE K

    then K increases by 6. If they interlace thus:

    LOAD K; ADD 3; LOAD K; ADD 3; STORE K; STORE K;

    then K increases by 3.

  4. This can be really hard to debug, particularly when you don't know where this is happening.

    One solution is to serialize the offending code, so that only one thread executes it at a time. The limit would be to serialize the whole program and not to use parallel processing.

  5. OpenMP has several mechanisms to help.

  6. A critical pragma serializes the following block. There are two considerations.

    1. You lose the benefits of parallelization on that block.
    2. There is an overhead to set up a critical block that I estimate might be 100,000 instructions.
  7. An atomic pragma serializes one short C statement, which must be one of a small set of allowed operations, such as increment. It matches certain atomic machine instructions, like test-and-set. The overhead is much smaller.

  8. If every thread is summing into a common total variable, the reduction pragma causes each thread to sum into a private subtotal, and then sum the subtotals. This is very fast.

  9. Another lesson is that sometimes you can check your program's correctness with an independent computation. For the trivial example of summing \(i^2\), use the formula

    \(\sum_{i=1}^N i^2 = N(N+1)(2N+1)/6\).

    There is a lot of useful algebra if you have the time to learn it. I, ahem, learned this formula in high school.

  10. Another lesson is that even when the program gives the same answer every time, it might still be consistently wrong.

  11. Another is that just including OpenMP facilities, like -fopenmp, into your program slows it down even if you don't use them.

  12. Another is that the only meaningful time metric is elapsed real time. One reason that CPU time is meaningless is the OpenMP sometimes pauses a thread's progress by making it do a CPU-bound loop. (That is also a common technique in HW.)

  13. Also note that CPU times can vary considerably with successive executions.

  14. Also, using too many threads might increase the real time.

  15. Finally, floating point numbers have their own problems. They are an approximation of the mathematical concept called the real number field. That is defined by 11 axioms that state obvious properties, like

    A+B=B+A (commutativity) and

    A+(B+C)=(A+B+C) (associativity).

    This is covered in courses on modern algebra or abstract algebra.

    The problem is that most of these are violated, at least somewhat, with floats. E.g.,

    \(\left(10^{20}-10^{20}\right)+1=0+1=1\) but

    \(10^{20}+\left(-10^{20}+1\right)=10^{20}-10^{20}=0\).

    Therefore when threads execute in a different order, floating results might be different.

    There is no perfect solution, though using double is a start.

    1. On modern CPUs, double is just as fast as float.

    2. However it's slower on GPUs.

      How much slower can vary. Nvidia's Maxwell line spent very little real estate on double precision, so it was very slow. In contrast, Maxwell added a half-precision float, for implementing neural nets. Apparently neural nets use very few significant digits.

      Nvidia's newer Pascal line reverted this design choice and spends more real estate on implementing double. parallel.ecse's GPU is Pascal.

      Nvidia's Volta generation also does doubles well;

    3. It also requires moving twice the data, and data movement is often more important than CPU time.

    The large field of numerical analysis is devoted to finding solutions, with more robust and stable algorithms.

    Summing an array by first sorting, and then summing the absolutely smaller elements first is one technique. Inverting a matrix by pivoting on the absolutely largest element, instead of on \(a_{11}\) is another.

  16. Another nice, albeit obsolescent and incomplete, ref: Wikibooks OpenMP.

  17. Its tasks examples are interesting.

  18. OpenMP tasks

    1. While inside a pragma parallel, you queue up lots of tasks - they're executed as threads are available.

    2. My example is tasks.cc with some variants.

    3. taskfib.cc is modified from an example in the OpenMP spec.

      1. It shows how tasks can recursively spawn more tasks.

      2. This is only an example; you would never implement fibonacci that way. (The fastest way is the following closed form. In spite of the sqrts, the expression always gives an integer. )

        \(F_n = \frac{\left( \frac{1+\sqrt{5}}{2} \right) ^n - \left( \frac{1-\sqrt{5}}{2} \right) ^n }{\sqrt{5}}\)

      3. Spawning a task is expensive; we can calculate the cost.

  19. stacksize.cc - get and set stack size. Can also do ulimit -s.

  20. omps.cc - read assorted OMP variables.

  21. Since I/O is not thread-safe, in hello_single_barrier.cc it has to be protected by a critical and also followed by a barrier.

  22. OpenMP sections - my example is sections.cc with some variants.

    1. Note my cool way to print an expression's name followed by its value.
    2. Note the 3 required levels of pragmas: parallel, sections, section.
    3. The assignment of sections to threads is nondeterministic.
    4. IMNSHO, OpenMP considerably easier than pthreads, fork, etc.
  23. http://stackoverflow.com/questions/13788638/difference-between-section-and-task-openmp

PAR Class 5, Mon 2019-01-28

1   Remote mounting and accessing files

This works for me. You must have ssh set up.

  1. Run nemo,
    1. Click files > connect to server
    2. Type host but not user or password.
    3. Remote filesystem is mounted locally under /run/user/USERNUMBER/gvfs/sftp:host=HOST/
  2. In emacs, use tramp mode: Access file as /HOST:/FILE

2   Student talks

5 more talks.

PAR Class 4, Thurs 2019-01-24

1   Student talks

  1. in the order that I listed the topics:
    1. Describe the fastest 2 computers on the top500 list. https://www.top500.org/lists/2018/11/
    2. Describe computers 3 and 4 on top500.
    3. Describe computers 5 and 6 on top500.
    4. Describe RPI's Blue Gene.
    5. Summarize python's parallel capabilities.
    6. Summarize Matlab's parallel capabilities.
    7. Summarize Mathematica's parallel capabilities.
    8. Summarize MPI.
    9. Describe an application where physics needs/uses parallel computing.
    10. Describe an application where astronomy needs/uses parallel computing.
    11. Describe an application where biology needs/uses parallel computing.
    12. Describe an application where astronomy needs/uses parallel computing.
    13. Give an application where machine learning needs/uses parallel computing.
    14. Give an application where computational fluid dynamics needs/uses parallel computing.
    15. Summarize OpenACC.
    16. Summarize pthreads.
  2. If you picked 2 topics, ok to give only one talk.

2   ssh, afs, zfs

  1. I recommend that you create key pairs to connect to parallel w/o typing your password each time.

  2. To avoid retyping your ssh private key passphrase in the same shell, do this

    ssh-add

  3. One advantage of ssh is that you can mount your parallel directory on your local machine. On my laptop, I run nemo, then do File - Connect to server, choose type ssh, server parallel.ecse.rpi.edu, use my RCSID as the user name, leave the password blank, and give my private key passphrase if asked. Any other program to mount network shares would work as well.

    Where the share is actually mounted varies. On my laptop, it's here: /var/run/user/1000/gvfs/sftp:host=parallel.ecse.rpi.edu,user=wrf

    As with any network mount, fancy filesystem things like attributes, simultaneous reading and writing from different machines, etc., probably won't work.

  4. With ssh, you can also copy files back and forth w/o mounting or typing passwords:

    1. scp localfile parallel.ecse.rpi.edu:
    2. scp -r localdir parallel.ecse.rpi.edu:
    3. scp parallel.ecse.rpi.edu:remotefile .

    It even does filename completion on remote files.

  5. You can also run single commands:

    ssh parallel.ecse.rpi.edu hostname

  6. Parallel sometimes implements AFS, so your RCS files are accessible (read and write) at

    /afs/rpi/edu/home/??/RCSUID.

    Search for your home dir thus:

    ls -ld /afs/rpi.edu/home//RCSID*.

    Authenticate yourself with klog.

PAR Class 3, Thurs 2019-01-17

1   parallel.ecse

Its accounts are completely separate from RCS; I just had you use the same userid for convenience. Changing your password on parallel does not affect your RCS password, and vv.

3   Student talks - 1st set

  1. Go to https://doodle.com/poll/viz67bb7xrw45yv3. Pick a topic by entering your name and rcsid (because sometimes students have similar names and sometimes students use different names from what's on the classlist). The topics are:
    1. Describe the fastest 2 computers on the top500 list. https://www.top500.org/lists/2018/11/
    2. Describe computers 3 and 4 on top500.
    3. Describe computers 5 and 6 on top500.
    4. Describe RPI's Blue Gene.
    5. Summarize python's parallel capabilities.
    6. Summarize Matlab's parallel capabilities.
    7. Summarize Mathematica's parallel capabilities.
    8. Summarize MPI.
    9. Describe an application where physics needs/uses parallel computing.
    10. Describe an application where astronomy needs/uses parallel computing.
    11. Describe an application where biology needs/uses parallel computing.
    12. Describe an application where astronomy needs/uses parallel computing.
    13. Give an application where machine learning needs/uses parallel computing.
    14. Give an application where computational fluid dynamics needs/uses parallel computing.
    15. Summarize OpenACC.
    16. Summarize pthreads.
    17. Other related topic of your choice. Tell me an I'll add it to the list.
  2. Note that any topic can be picked at most once.
  3. Two students may work together to present one talk.
  4. Prepare and give 5-10 minute talk next Thurs 1/24. A team would jointly give one 5-10 minute talk.

4   OpenMP

  1. We'll see some examples running on parallel, at /parallel-class/openmp/rpi

  2. We saw that running even a short OpenMP program could be unpredictable. The reason for that was a big lesson.

    There are no guarantees about the scheduling of the various threads. There are no guarantees about fairness of resource allocation between the threads. Perhaps thread 0 finishes before thread 1 starts. Perhaps thread 1 finishes before thread 2 starts. Perhaps thread 0 executes 3 machine instructions, then thread 1 executes 1 instruction, then thread 0 executes 100 more, etc. Each time the process runs, something different may happen. One thing may happen almost all the time, to trick you.

5   Compiling and running

  1. c++ compilers: g++-8 etc., clang-7 etc.. I'm updating them.
  2. Measuring program times when many people are on the machine:
    1. This is a problem.
    2. One partial solution is to use batch to queue your job to run when the system is unloaded.

PAR Class 1, Thurs 2019-01-10

1   Material

  1. Read the syllabus.
  2. Read https://computing.llnl.gov/tutorials/parallel_comp/ for an intro to parallel computing.
  3. Some points:
    1. Parallel computing is decades old; there were commercial machines in the 1980s. However, then, clocks speeds were increasing so serial machines were more interesting.
    2. Now: physics limits to processor speed.
    3. History of Nvidia.
    4. Intel CPUs vs Nvidia CUDA cores.
    5. Advantages and disadvantages of shared memory.
    6. OpenMP vs CUDA.
    7. Rate-limiting cost usually I/O not computation.
  4. Think about these questions for next week.
    1. Answer the following questions.
      1. Why have machine cycle speeds stopped increasing?
      2. What architectures do the top 3 machines on the Top 500 list use?
      3. Which one of the following 4 choices are most GPUs: SISD, SIMD, MISD, MIMD.
      4. Which one of the following 4 choices are most current multicore CPUs: SISD, SIMD, MISD, MIMD.
      5. Per Amdahl's law, if a program is 10% sequential and 90% parallelizable, what is the max speed up that can be obtained with an infinite number of parallel processors?

2   Computer

  1. parallel.ecse accounts
    1. I'll try to set yours up in class today.
    2. Play with them. I recomend connecting with ssh.
  2. Intel Xeon Phi coprocessor
    1. Parallel.ecse has one.
    2. This tech was dropped by Intel, so I don't intend to cover it in class.
    3. If anyone is interested, tell me today. Otherwise it may stop working when I upgrade the system from Ubuntu 16.04 to 18.xx

3   My research

I do parallel geometry algorithms on large problems for CAD and GIS. See my home page. If this is interesting, talk to me.

PAR Syllabus

1   Catalog info

Title: ECSE-4740-01 Applied Parallel Computing for Engineers, CRN 74971
Semesters: Spring term annually
Credits: 3 credit hours
Time and place: Mon and Thurs noon-1:20pm, JEC6314 (note the change).

2   Description

  1. This is intended to be a computer engineering course to provide students with knowledge and hands-on experience in developing applications software for affordable parallel processors. This course will cover hardware that any lab can afford to purchase. It will cover the software that, in the prof's opinion, is the most useful. There will also be some theory.
  2. The target audiences are ECSE seniors and grads and others with comparable background who wish to develop parallel software.
  3. This course will have minimal overlap with parallel courses in Computer Science. We will not teach the IBM BlueGene, because it is so expensive, nor cloud computing and MPI, because most big data problems are in fact small enough to fit on our hardware.
  4. You may usefully take all the parallel courses at RPI.
  5. This unique features of this course are as follows:
    1. Use of only affordable hardware that any lab might purchase, such as Nvidia GPUs. This is currently the most widely used and least expensive parallel platform.
    2. Emphasis on learning several programming packages, at the expense of theory. However you will learn a lot about parallel architecture.
  6. Hardware taught, with reasons:
    Multicore Intel Xeon:
      universally available and inexpensive, comparatively easy to program, powerful
    Nvidia GPU accelerator:
      widely available (Nvidia external graphics processors are on 1/3 of all PCs), very inexpensive, powerful, but harder to program. Good cards cost only a few hundred dollars.
    IBM quantum computer:
      (perhaps)
  7. Software that might be taught, with reasons:
    OpenMP C++ extension:
      widely used, easy to use if your algorithm is parallelizable, backend is multicore Xeon.
    Thrust C++ functional programming library:
      FP is nice, hides low level details, backend can be any major parallel platform.
    MATLAB: easy to use parallelism for operations that Mathworks has implemented in parallel, etc.
    CUDA C++ extension and library for Nvidia:
      low level access to Nvidia GPUs.
  8. The techniques learned here will also be applicable to larger parallel machines -- numbers 1 and 2 on the top 500 list use NVIDIA GPUs. (Number 10 is a BlueGene.)
  9. Effectively programming these processors will require in-depth knowledge about parallel programming principles, as well as the parallelism models, communication models, and resource limitations of these processors.

3   Prerequisite

ECSE-2660 CANOS or equivalent, knowledge of C++.

4   Instructors

4.1   Professor

W. Randolph Franklin. BSc (Toronto), AM, PhD (Harvard)

Office:

Jonsson Engineering Center (JEC) 6026

Phone:

+1 (518) 276-6077 (forwards)

Email:

frankwr@YOUKNOWTHEDOMAIN

Email is my preferred communication medium.

Non-RPI accounts are fine, but please show your name, at least in the comment field. A subject prefix of #Prob is helpful. GPG encryption is fine.

Web:

https://wrf.ecse.rpi.edu/

A quick way to get there is to google RPIWRF.

Office hours:

After each lecture, usually as long as anyone wants to talk. Also by appointment.

Informal meetings:
 

If you would like to lunch with me, either individually or in a group, just mention it. We can then talk about most anything legal and ethical.

5   Course websites

The homepage has lecture summaries, syllabus, homeworks, etc.

6   Reading material

6.1   Text

There is no required text, but the following inexpensive books may be used. I might mention others later.

  1. Sanders and Kandrot, CUDA by example. It gets excellent reviews, although it is several years old. Amazon has many options, including Kindle and renting hardcopies.
  2. Kirk and Hwu, 2nd edition, Programming massively parallel processors. It concentrates on CUDA.

One problem is that even recent books may be obsolete. For instance they may ignore the recent CUDA unified memory model, which simplifies CUDA programming at a performance cost. Even if the current edition of a book was published after unified memory was released, the author might not have updated the examples.

6.2   Web

There is a lot of free material on the web, which I'll reference class by class. Because web pages are vanish so often (really!), I may cache some locally. If interested, you might start here:

https://hpc.llnl.gov/training/tutorials

7   Computer systems used

This course will primarily use (remotely via ssh) parallel.ecse.rpi.edu.

Parallel has:

  1. a dual 14-core Intel Xeon E5-2660 2.0GHz
  2. 256GB of DDR4-2133 ECC Reg memory
  3. Nvidia GPU, perhaps GeForce GTX 1080 processor with 8GB
  4. Intel Xeon Phi 7120A
  5. Samsung Pro 850 1TB SSD
  6. WD Red 6TB 6GB/s hard drive
  7. CUDA
  8. OpenMP 4.0
  9. Thrust
  10. Ubuntu 16.04

Material for the class is stored in /parallel-class/ .

We may also use geoxeon.ecse.rpi.edu. Geoxeon has:

  1. Dual 8-core Intel Xeon E5-2687W 3.1Ghz 8.0GT/s 20mb 150w.
  2. 128GB DRAM.
  3. Nvidia GPUs: #. GM200 GeForce GTX Titan X #. GK100GL Tesla K20Xm
  4. Ubuntu 18.04.1
  5. CUDA, Thrust, OpenMP, etc.

We may also use a parallel virtual machine on the Amazon EC2. If so, you will be expected to establish an account. I expect the usage to be in the free category.

8   Assessment measures, i.e., grades

  1. There will be no exams.

  2. The grade will be based on a term project and class presentations.

  3. Optional ungraded homeworks will be assigned, with the solutions discussed in a later class.

  4. Deliverables for the term project:

    1. A 2-minute project proposal given to the class around the middle of the semester.
    1. A 10-minute project presentation given to the class in the last week.
    1. Some progress reports.
    1. A write-up uploaded on the last class day. This will contain an academic paper, code and perhaps video or user manual.

8.1   Term project

  1. For the latter part of the course, most of your homework time will be spent on a term project.
  2. You are encouraged do it in teams of up to 3 people. A team of 3 people would be expected to do twice as much work as 1 person.
  3. You may combine this with work for another course, provided that both courses know about this and agree. I always agree.
  4. If you are a grad student, you may combine this with your research, if your prof agrees, and you tell me.
  5. You may build on existing work, either your own or others'. You have to say what's new, and have the right to use the other work. E.g., using any GPLed code or any code on my website is automatically allowable (because of my Creative Commons licence).
  6. You will implement, demonstrate, and document something vaguely related to parallel computing.
  7. You will give a 15 minute talk and demo in class.

8.1.1   Size of term project

It's impossible to specify how many lines of code makes a good term project. E.g., I take pride in writing code that is can be simultaneously shorter, more robust, and faster than some others. See my 8-line program for testing whether a point is in a polygon: Pnpoly.

According to Big Blues, when Bill Gates was collaborating with around 1980, he once rewrote a code fragment to be shorter. However, according to the IBM metric, number of lines of code produced, he had just caused that unit to officially do negative work.

8.1.2   Deliverables

  1. An implementation showing parallel computing.
  2. An extended abstract or paper on your project, written up like a paper. You should follow the style guide for some major conference (I don't care which, but can point you to one).
  3. A more detailed manual, showing how to use it.
  4. A talk in class.

8.2   Early warning system (EWS)

As required by the Provost, we may post notes about you to EWS, for example, if you're having trouble doing homeworks on time, or miss an exam. E.g., if you tell me that you had to miss a class because of family problems, then I may forward that information to the Dean of Students office.

9   Academic integrity

See the Student Handbook for the general policy. The summary is that students and faculty have to trust each other. After you graduate, your most important possession will be your reputation.

Specifics for this course are as follows.

  1. You may collaborate on homeworks, but each team of 1 or 2 people must write up the solution separately (one writeup per team) using their own words. We willingly give hints to anyone who asks.
  2. The penalty for two teams handing in identical work is a zero for both.
  3. You may collaborate in teams of up to 3 people for the term project.
  4. You may get help from anyone for the term project. You may build on a previous project, either your own or someone else's. However you must describe and acknowledge any other work you use, and have the other person's permission, which may be implicit. E.g., my web site gives a blanket permission to use it for nonprofit research or teaching. You must add something creative to the previous work. You must write up the project on your own.
  5. However, writing assistance from the Writing Center and similar sources in allowed, if you acknowledge it.
  6. The penalty for plagiarism is a zero grade.
  7. Cheating will be reported to the Dean of Students Office.

10   Student feedback

Since it's my desire to give you the best possible course in a topic I enjoy teaching, I welcome feedback during (and after) the semester. You may tell me or write me, or contact a third party, such as Prof James Lu, the ECSE undergrad head, or Prof John Wen, the ECSE Dept head.