PAR Class 26, Thu 2019-04-18

W Randolph Franklin, RPI

2019-04-18 00:00

Source

Table of contents::

1 Final presentations
2 Inspiration for finishing your term projects
3 Intel Xeon Phi 7120A
- 3.1 Summary
- 3.2 In general
- 3.3 parallel.ecse's mic
- 3.4 Programming the mic
4 IBM Blue Gene
5 Cloud computing
6 More parallel tools
- 6.1 cuFFT Notes
- 6.2 cuBLAS etc Notes
- 6.3 Matlab
- 6.4 Mathematica in parallel

1 Final presentations

Everyone has now signed up for next week, Mon or Thurs.

https://doodle.com/poll/kcsek2wi39s97uzm

Nevertheless, I'll entertain tales of woe about the date.

2 Inspiration for finishing your term projects

The Underhanded C Contest

"The goal of the contest is to write code that is as readable, clear, innocent and straightforward as possible, and yet it must fail to perform at its apparent function. To be more specific, it should do something subtly evil. Every year, we will propose a challenge to coders to solve a simple data processing problem, but with covert malicious behavior. Examples include miscounting votes, shaving money from financial transactions, or leaking information to an eavesdropper. The main goal, however, is to write source code that easily passes visual inspection by other programmers."
The International Obfuscated C Code Contest
https://www.awesomestories.com/asset/view/Space-Race-American-Rocket-Failures

Moral: After early disasters, sometimes you can eventually get things to work.
The 'Wrong' Brothers Aviation's Failures (1920s)
Early U.S. rocket and space launch failures and explosion
Numerous US Launch Failures

3 Intel Xeon Phi 7120A

3.1 Summary

Intel's answer to Nvidia
The start of Wikipedia might be interesting.
Parallel.ecse has a Phi.
However, because of lack of demand, I haven't installed the current drivers, so it's not usable.
In July 2018, Intel ended this product:
1. https://www.nextplatform.com/2018/07/27/end-of-the-line-for-xeon-phi-its-all-xeon-from-here/
2. https://www.networkworld.com/article/3296004/intel-ends-the-xeon-phi-product-line.html
Nevertheless, anything Intel does is worth a look.
Historically, Intel has introduced a number of products that failed. From time to time, one of their products also succeeds. Perhaps they should not start to work on projects that are going to fail. :-)
Intel also incorporated many of the Phi's features into the regular Xeon.

3.2 In general

The Xeon Phi is Intel's brand name for their MIC (for Many Integrated Core Architecture).
The 7120a is Intel's Knights Landing (1st generation) MIC architecure, launched in 2014.
It has 61 cores running about 244 threads clocked at about 1.3GHz.

Having several threads per core helps to overcome latency in fetching stuff.
It has 16GB of memory accessible at 352 GB/s, 30BM L2 cache, and peaks at 1TFlops double precision.
It is a coprocessor on a card accessible from a host CPU on a local network.
It is intended as a supercomputing competitor to Nvidia.
The mic architecture is quite similar to the Xeon.
However executables from one don't run on the other, unless the source was compiled to include both versions in the executable file.
The mic has been tuned to emphasize floating performance at the expense of, e.g., speculative execution.

This helps to make it competitive with Nvidia, even though Nvidia GPUs can have many more cores.
Its OS is busybox, an embedded version of linux.
The SW is called MPSS (Manycore Platform Software Stack).
The mic can be integrated with the host in various ways that I haven't (yet) implemented.
1. Processes on the host can execute subprocesses on the device, as happens with Nvidia CUDA.
2. E.g., OpenMP on the host can run parallel threads on the mic.
3. The mic can page virtual memory from the host.
The fastest machine on top500.org a few years ago used Xeon Phi cards.

The 2nd used Nvidia K20 cards, and the 3rd fastest was an IBM Bluegene.

So, my course lets you use the 2 fastest architectures, and there's another course available at RPI for the 3rd.
Information:

3.3 parallel.ecse's mic

The hostname (of this particular MIC) is parallel-mic0 or mic0.
The local filesystem is in RAM and is reinitialized when mic0 is rebooted.
Parallel:/home and /parallel-class are NFS exported to mic0.
/home can be used to move files back and forth.
All the user accounts on parallel were given accounts on mic0.
You can ssh to mic0 from parallel.
Your current parallel ssh key pair should work.
Your parallel login password as of a few days ago should work on mic0.

However, future changes to your parallel password will not propagate to mic0 and you cannot change your mic0 password.

(The mic0 setup snapshotted parallel's accounts and created a read-only image to boot mic0 from. Any changes to mic0:/etc/shadow are reverted when mic0 reboots.)

So use your public key.

3.4 Programming the mic

Parallel:/parallel-class/mic/bin has versions of gcc, g++, etc, with names like k1om-mpss-linux-g++ .
These run on parallel and produce executable files that run (only) on mic0.
Here's an example of compiling (on parallel) a C program in /parallel-class/mic
```
bin/k1om-mpss-linux-gcc hello.c -o hello-mic
```
Run it thus from parallel (it runs on mic0):
```
ssh mic0  /parallel-class/mic/hello-mic
```

4 IBM Blue Gene

IBM has also ended the Blue Gene product line.
They're currently pushing an OpenPOWER POWER8 host plus Nivida (or other) devices.
E.g., it tightly couples a POWER8 to several Nvidia RTX cards.

5 Cloud computing

(Enrichment; read on your own)

The material is from Wikipedia, which appeared better than any other sources that I could find.

Hierarchy:
1. IaaS (Infrastructure as a Service)
  1. Sample functionality: VM, storage
  2. Examples:
    1. Google_Compute_Engine
    2. Amazon_Web_Services
    3. OpenStack : compute, storage, networking, dashboard
2. PaaS (Platform ...)
  1. Sample functionality: OS, Web server, database server
  2. Examples:
    1. OpenShift
    2. Cloud_Foundry
    3. Hadoop :
      1. distributed FS, Map Reduce
      2. derived from Google FS, map reduce
      3. used by Facebook etc.
  3. Now, people often run Apache Spark™ - Lightning-Fast Cluster Computing instead of Hadoop, because Spark is faster.
3. SaaS (Software ...)
  1. Sample functionality: email, gaming, CRM, ERP
Cloud_computing_comparison
Virtual machine.

The big question is, at what level does the virtualization occur? Do you duplicate the whole file system and OS, even emulate the HW, or just try to isolate files and processes in the same OS.
1. Virtualization
2. Hypervisor
3. Xen
4. Kernel-based_Virtual_Machine
5. QEMU
6. VMware
7. Containers, docker,
8. Comparison_of_platform_virtual_machines
Distributed storage
See also
1. VNC
2. Grid_computing
  1. decentralized, heterogeneous
  2. used for major projects like protein folding

6 More parallel tools

6.1 cuFFT Notes

GPU Computing with CUDA Lecture 8 - CUDA Libraries - CUFFT, PyCUDA from Christopher Cooper, BU
video #8 - CUDA 5.5 cuFFT FFTW API Support. 3 min.
cuFFT is inspired by FFTW (the fastest Fourier transform in the west), which they say is so fast that it's as fast as commercial FFT packages.
I.e., sometimes commercial packages may be worth the money.
Although the FFT is taught for N a power of two, users often want to process other dataset sizes.
The problem is that the optimal recursion method, and the relevant coefficients, depends on the prime factors of N.
FFTW and cuFFT determine the good solution procedure for the particular N.
Since this computation takes time, they store the method in a plan.
You can then apply the plan to many datasets.
If you're going to be processing very many datasets, you can tell FFTW or cuFFT to perform sample timing experiments on your system, to help in devising the best plan.
That's a nice strategy that some other numerical SW uses.
One example is Automatically Tuned Linear Algebra Software (ATLAS).

6.2 cuBLAS etc Notes

BLAS is an API for a set of simple matrix and vector functions, such as multiplying a vector by a matrix.
These functions' efficiency is important since they are the basis for widely used numerical applications.
Indeed you usually don't call BLAS functions directly, but use higher-level packages like LAPACK that use BLAS.
There are many implementations, free and commercial, of BLAS.
cuBLAS is one.
One reason that Fortran is still used is that, in the past, it was easier to write efficient Fortran programs than C or C++ programs for these applications.
There are other, very efficient, C++ numerical packages. (I can list some, if there's interest).
Their efficiency often comes from aggressively using C++ templates.
SC12 Demo: Using CUDA Library to accelerate applications 7:11 https://www.youtube.com/watch?v=P2Ew4Ljyi6Y
Matrix mult example

6.3 Matlab

Good for applications that look like matrices.

Considerable contortions required for, e.g., a general graph. You'd represent that with a large sparse adjacency matrix.
Using explicit for loops is slow.
Efficient execution when using builtin matrix functions,

but can be difficult to write your algorithm that way, and

difficult to read the code.
Very expensive and getting more so.

Many separately priced apps.
Uses state-of-the-art numerical algorithms.

E.g., to solve large sparse overdetermined linear systems.

Better than Mathematica.
Most or all such algorithms also freely available as C++ libraries.

However, which library to use?

Complicated calling sequences.

Obscure C++ template error messages.
Graphical output is mediocre.

Mathematica is better.
Various ways Matlab can execute in parallel
1. Operations on arrays can execute in parallel.
  
  E.g. B=SIN(A) where A is a matrix.
2. Automatic multithreading by some functions
  
  Various functions, like INV(a), automatically use perhaps 8 cores.
  
  The '8' is a license limitation.
  
  Which MATLAB functions benefit from multithreaded computation?
3. PARFOR
  
  Like FOR, but multithreaded.
  
  However, FOR is slow.
  
  Many restrictions, e.g., cannot be nested.
  
  Matlab's introduction to parallel solutions
  
  Start pools first with: MATLABPOOL OPEN 12
  
  Limited to 12 threads.
  
  Can do reductions.
4. Parallel Computing Server
  
  This runs on a parallel machine, including Amazon EC2.
  
  Your client sends batch or interactive jobs to it.
  
  Many Matlab toolboxes are not licensed to use it.
  
  This makes it much less useful.
5. GPU computing
  
  Create an array on device with gpuArray
  
  Run builtin functions on it.
  
  Matlab's run built in functions on a gpu
Parallel and GPU Computing Tutorials, Part 9: GPU Computing with MATLAB 6:19 https://www.youtube.com/watch?v=1WTIHzwJ4j0

6.4 Mathematica in parallel

You terminate an input command with shift-enter.

Some Mathematica commands:

Sin[1.]
Plot[Sin[x],{x,-2,2}]
a=Import[
 "/opt/parallel/mathematica/mtn1.dat"]
Information[a]
Length[a]
b=ArrayReshape[a,{400,400}]
MatrixPlot[b]
ReliefPlot[b]
ReliefPlot[b,Method->"AspectBasedShading"]
ReliefPlot[MedianFilter[b,1]]
Dimensions[b]
Eigenvalues[b]   *When you get bored*
   * waiting, type * alt-.
Eigenvalues[b+0.0]
Table[ {x^i y^j,x^j y^i},{i,2},{j,2}]
Flatten[Table[ {x^i y^j,x^j y^i},{i,2},{j,2}],1]
StreamPlot[{x*y,x+y},{x,-3,3},{y,-3,3}]
$ProcessorCount
$ProcessorType
*Select *Parallel Kernel Configuration*
   and *Status* in the *Evaluation* menu*
ParallelEvaluate[$ProcessID]
PrimeQ[101]
Parallelize[Table[PrimeQ[n!+1],{n,400,500}]]
merQ[n_]:=PrimeQ[2^n-1]
Select[Range[5000],merQ]
ParallelSum[Sin[x+0.],{x,0,100000000}]
Parallelize[  Select[Range[5000],merQ]]
Needs["CUDALink`"]  *note the back quote*
CUDAInformation[]
Manipulate[n, {n, 1.1, 20.}]
Plot[Sin[x], {x, 1., 20.}]
Manipulate[Plot[Sin[x], {x, 1., n}], {n, 1.1, 20.}]
Integrate[Sin[x]^3, x]
Manipulate[Integrate[Sin[x]^n, x], {n, 0, 20}]
Manipulate[{n, FactorInteger[n]}, {n, 1, 100, 1}]
Manipulate[Plot[Sin[a x] + Sin[b x], {x, 0, 10}],
    {a, 1, 4}, {b, 1, 4}]

Unfortunately there's a problem that I'm still debugging with the Mathematica - CUDA interface.

Applications of GPU Computation in Mathematica (42:18) https://www.youtube.com/watch?v=LZAZ4ddUMKU