PAR Class 12, Wed 2018-04-18

One of my rules is to push design decisions to take effect as early in the process execution as possible. Constructing variables at compile time is best, at function call time is 2nd, and on the heap is worst.

If I have to construct variables on the heap, I construct few and large variables, never many small ones.
Often I compile the max dataset size into the program, which permits constructing the arrays at compile time. Recompiling for a larger dataset is quick (unless you're using CUDA).

Accessing this type of variable uses one less level of pointer than accessing a variable on the heap. I don't know whether this is faster with a good optimizing compiler, but it's probably not slower.
If the data will require a dataset with unpredictably sized components, such as a ragged array, then I may do the following.
1. Read the data once to accumulate the necessary statistics.
2. Construct the required ragged array.
3. Reread the data and populate the array.

2.2 Faster graphical access to parallel.ecse

X over ssh is very slow.

Here are some things I've discovered that help, and that work sometimes.

Use xpra; here's an example:
1. On parallel.ecse:
```
xpra start :77; DISPLAY=:77 xeyes&
```
  Don't everyone use 77, pick your own numbers in the range 20-99.
2. On server, i.e., your machine:
```
xpra attach ssh:parallel.ecse.rpi.edu:77
```
3. I suspect that security is weak. When you start an xpra session, In suspect that anyone on parallel.ecse can display to it. I suspect that anyone with ssh access to parallel.ecse can try to attach to it, and the that 1st person wins.
Use nx, which needs a server, e.g., FreeNX.

3 Jack Dongarra videos

Sunway TaihuLight's strengths and weaknesses highlighted. 9 min. 8/21/2016.

This is the new fastest known machine on top500. A machine with many Intel Xeon Phi coprocessors is now 2nd, Nvidia K20 is 3rd, and some machine built by a company down the river is 4th. These last 3 machines have been at the top for a surprisingly long time.
An Overview of High Performance Computing and Challenges for the Future. 57min, 11/16/2016.

4 More parallel tools

4.1 cuFFT Notes

GPU Computing with CUDA Lecture 8 - CUDA Libraries - CUFFT, PyCUDA from Christopher Cooper, BU
video #8 - CUDA 5.5 cuFFT FFTW API Support. 3 min.
cuFFT is inspired by FFTW (the fastest Fourier transform in the west), which they say is so fast that it's as fast as commercial FFT packages.
I.e., sometimes commercial packages may be worth the money.
Although the FFT is taught for N a power of two, users often want to process other dataset sizes.
The problem is that the optimal recursion method, and the relevant coefficients, depends on the prime factors of N.
FFTW and cuFFT determine the good solution procedure for the particular N.
Since this computation takes time, they store the method in a plan.
You can then apply the plan to many datasets.
If you're going to be processing very many datasets, you can tell FFTW or cuFFT to perform sample timing experiments on your system, to help in devising the best plan.
That's a nice strategy that some other numerical SW uses.
One example is Automatically Tuned Linear Algebra Software (ATLAS).

4.2 cuBLAS etc Notes

BLAS is an API for a set of simple matrix and vector functions, such as multiplying a vector by a matrix.
These functions' efficiency is important since they are the basis for widely used numerical applications.
Indeed you usually don't call BLAS functions directly, but use higher-level packages like LAPACK that use BLAS.
There are many implementations, free and commercial, of BLAS.
cuBLAS is one.
One reason that Fortran is still used is that, in the past, it was easier to write efficient Fortran programs than C or C++ programs for these applications.
There are other, very efficient, C++ numerical packages. (I can list some, if there's interest).
Their efficiency often comes from aggressively using C++ templates.
Matrix mult example

4.3 Matlab

Good for applications that look like matrices.

Considerable contortions required for, e.g., a general graph. You'd represent that with a large sparse adjacency matrix.
Using explicit for loops is slow.
Efficient execution when using builtin matrix functions,

but can be difficult to write your algorithm that way, and

difficult to read the code.
Very expensive and getting more so.

Many separately priced apps.
Uses state-of-the-art numerical algorithms.

E.g., to solve large sparse overdetermined linear systems.

Better than Mathematica.
Most or all such algorithms also freely available as C++ libraries.

However, which library to use?

Complicated calling sequences.

Obscure C++ template error messages.
Graphical output is mediocre.

Mathematica is better.
Various ways Matlab can execute in parallel
1. Operations on arrays can execute in parallel.
  
  E.g. B=SIN(A) where A is a matrix.
2. Automatic multithreading by some functions
  
  Various functions, like INV(a), automatically use perhaps 8 cores.
  
  The '8' is a license limitation.
  
  Which MATLAB functions benefit from multithreaded computation?
3. PARFOR
  
  Like FOR, but multithreaded.
  
  However, FOR is slow.
  
  Many restrictions, e.g., cannot be nested.
  
  Matlab's introduction to parallel solutions
  
  Start pools first with: MATLABPOOL OPEN 12
  
  Limited to 12 threads.
  
  Can do reductions.
4. Parallel Computing Server
  
  This runs on a parallel machine, including Amazon EC2.
  
  Your client sends batch or interactive jobs to it.
  
  Many Matlab toolboxes are not licensed to use it.
  
  This makes it much less useful.
5. GPU computing
  
  Create an array on device with gpuArray
  
  Run builtin functions on it.
  
  Matlab's run built in functions on a gpu

4.4 Mathematica in parallel

You terminate an input command with shift-enter.

Some Mathematica commands:

Sin[1.]
Plot[Sin[x],{x,-2,2}]
a=Import[
 "/opt/parallel/mathematica/mtn1.dat"]
Information[a]
Length[a]
b=ArrayReshape[a,{400,400}]
MatrixPlot[b]
ReliefPlot[b]
ReliefPlot[b,Method->"AspectBasedShading"]
ReliefPlot[MedianFilter[b,1]]
Dimensions[b]
Eigenvalues[b]   *When you get bored*
   * waiting, type * alt-.
Eigenvalues[b+0.0]
Table[ {x^i y^j,x^j y^i},{i,2},{j,2}]
Flatten[Table[ {x^i y^j,x^j y^i},{i,2},{j,2}],1]
StreamPlot[{x*y,x+y},{x,-3,3},{y,-3,3}]
$ProcessorCount
$ProcessorType
*Select *Parallel Kernel Configuration*
   and *Status* in the *Evaluation* menu*
ParallelEvaluate[$ProcessID]
PrimeQ[101]
Parallelize[Table[PrimeQ[n!+1],{n,400,500}]]
merQ[n_]:=PrimeQ[2^n-1]
Select[Range[5000],merQ]
ParallelSum[Sin[x+0.],{x,0,100000000}]
Parallelize[  Select[Range[5000],merQ]]
Needs["CUDALink`"]  *note the back quote*
CUDAInformation[]
Manipulate[n, {n, 1.1, 20.}]
Plot[Sin[x], {x, 1., 20.}]
Manipulate[Plot[Sin[x], {x, 1., n}], {n, 1.1, 20.}]
Integrate[Sin[x]^3, x]
Manipulate[Integrate[Sin[x]^n, x], {n, 0, 20}]
Manipulate[{n, FactorInteger[n]}, {n, 1, 100, 1}]
Manipulate[Plot[Sin[a x] + Sin[b x], {x, 0, 10}],
    {a, 1, 4}, {b, 1, 4}]

Unfortunately there's a problem that I'm still debugging with the Mathematica - CUDA interface.

5 Nvidia videos

HPC and Supercomputing at GTC 2017 1 min
NVIDIA Self-Driving Car Demo at CES 2017 2 min
How Nvidia Went From Gaming to GPUs 3 min
Nvidia Volta To Be Released Q3 2017, Say Rumours | RX 480 Can be Flashed to RX 580 10 min. 4/19/2017
NVIDIA Opening Keynote Highlights at CES 2017 37 min

6 Cloud computing

The material is from Wikipedia, which appeared better than any other sources that I could find.

Hierarchy:
1. IaaS (Infrastructure as a Service)
  1. Sample functionality: VM, storage
  2. Examples:
    1. Google_Compute_Engine
    2. Amazon_Web_Services
    3. OpenStack : compute, storage, networking, dashboard
2. PaaS (Platform ...)
  1. Sample functionality: OS, Web server, database server
  2. Examples:
    1. OpenShift
    2. Cloud_Foundry
    3. Hadoop :
      1. distributed FS, Map Reduce
      2. derived from Google FS, map reduce
      3. used by Facebook etc.
  3. Now, people often run Apache Spark™ - Lightning-Fast Cluster Computing instead of Hadoop, because Spark is faster.
3. SaaS (Software ...)
  1. Sample functionality: email, gaming, CRM, ERP
Cloud_computing_comparison
Virtual machine.

The big question is, at what level does the virtualization occur? Do you duplicate the whole file system and OS, even emulate the HW, or just try to isolate files and processes in the same OS.
1. Virtualization
2. Hypervisor
3. Xen
4. Kernel-based_Virtual_Machine
5. QEMU
6. VMware
7. Containers, docker,
8. Comparison_of_platform_virtual_machines
Distributed storage
See also
1. VNC
2. Grid_computing
  1. decentralized, heterogeneous
  2. used for major projects like protein folding