PAR Class 12, Wed 20180418
Table of contents
1 Git
Git is good to simultaneously keep various versions. A git intro:
Create a dir for the project:
mkdir PROJECT; cd PROJECT
Initialize:
git init
Create a branch (you can do this several times):
git branch MYBRANCHNAME
Go to a branch:
git checkout MYBRANCHNAME
Do things:
vi, make, ....
Save it:
git add .; git commit mCOMMENT
Repeat
I might use this to modify a program for class.
2 Software tips
2.1 Freeze decisions early: SW design paradigm
One of my rules is to push design decisions to take effect as early in the process execution as possible. Constructing variables at compile time is best, at function call time is 2nd, and on the heap is worst.

If I have to construct variables on the heap, I construct few and large variables, never many small ones.

Often I compile the max dataset size into the program, which permits constructing the arrays at compile time. Recompiling for a larger dataset is quick (unless you're using CUDA).
Accessing this type of variable uses one less level of pointer than accessing a variable on the heap. I don't know whether this is faster with a good optimizing compiler, but it's probably not slower.

If the data will require a dataset with unpredictably sized components, such as a ragged array, then I may do the following.
 Read the data once to accumulate the necessary statistics.
 Construct the required ragged array.
 Reread the data and populate the array.
2.2 Faster graphical access to parallel.ecse
X over ssh is very slow.
Here are some things I've discovered that help, and that work sometimes.

Use xpra; here's an example:

On parallel.ecse:
xpra start :77; DISPLAY=:77 xeyes&
Don't everyone use 77, pick your own numbers in the range 2099.

On server, i.e., your machine:
xpra attach ssh:parallel.ecse.rpi.edu:77

I suspect that security is weak. When you start an xpra session, In suspect that anyone on parallel.ecse can display to it. I suspect that anyone with ssh access to parallel.ecse can try to attach to it, and the that 1st person wins.


Use nx, which needs a server, e.g., FreeNX.
3 Jack Dongarra videos

Sunway TaihuLight's strengths and weaknesses highlighted. 9 min. 8/21/2016.
This is the new fastest known machine on top500. A machine with many Intel Xeon Phi coprocessors is now 2nd, Nvidia K20 is 3rd, and some machine built by a company down the river is 4th. These last 3 machines have been at the top for a surprisingly long time.

An Overview of High Performance Computing and Challenges for the Future. 57min, 11/16/2016.
4 More parallel tools
4.1 cuFFT Notes
 GPU Computing with CUDA Lecture 8  CUDA Libraries  CUFFT, PyCUDA from Christopher Cooper, BU
 video #8  CUDA 5.5 cuFFT FFTW API Support. 3 min.
 cuFFT is inspired by FFTW (the fastest Fourier transform in the west), which they say is so fast that it's as fast as commercial FFT packages.
 I.e., sometimes commercial packages may be worth the money.
 Although the FFT is taught for N a power of two, users often want to process other dataset sizes.
 The problem is that the optimal recursion method, and the relevant coefficients, depends on the prime factors of N.
 FFTW and cuFFT determine the good solution procedure for the particular N.
 Since this computation takes time, they store the method in a plan.
 You can then apply the plan to many datasets.
 If you're going to be processing very many datasets, you can tell FFTW or cuFFT to perform sample timing experiments on your system, to help in devising the best plan.
 That's a nice strategy that some other numerical SW uses.
 One example is Automatically Tuned Linear Algebra Software (ATLAS).
4.2 cuBLAS etc Notes
 BLAS is an API for a set of simple matrix and vector functions, such as multiplying a vector by a matrix.
 These functions' efficiency is important since they are the basis for widely used numerical applications.
 Indeed you usually don't call BLAS functions directly, but use higherlevel packages like LAPACK that use BLAS.
 There are many implementations, free and commercial, of BLAS.
 cuBLAS is one.
 One reason that Fortran is still used is that, in the past, it was easier to write efficient Fortran programs than C or C++ programs for these applications.
 There are other, very efficient, C++ numerical packages. (I can list some, if there's interest).
 Their efficiency often comes from aggressively using C++ templates.
 Matrix mult example
4.3 Matlab

Good for applications that look like matrices.
Considerable contortions required for, e.g., a general graph. You'd represent that with a large sparse adjacency matrix.

Using explicit for loops is slow.

Efficient execution when using builtin matrix functions,
but can be difficult to write your algorithm that way, and
difficult to read the code.

Very expensive and getting more so.
Many separately priced apps.

Uses stateoftheart numerical algorithms.
E.g., to solve large sparse overdetermined linear systems.
Better than Mathematica.

Most or all such algorithms also freely available as C++ libraries.
However, which library to use?
Complicated calling sequences.
Obscure C++ template error messages.

Graphical output is mediocre.
Mathematica is better.

Various ways Matlab can execute in parallel

Operations on arrays can execute in parallel.
E.g. B=SIN(A) where A is a matrix.

Automatic multithreading by some functions
Various functions, like INV(a), automatically use perhaps 8 cores.
The '8' is a license limitation.
Which MATLAB functions benefit from multithreaded computation?

PARFOR
Like FOR, but multithreaded.
However, FOR is slow.
Many restrictions, e.g., cannot be nested.
Matlab's introduction to parallel solutions
Start pools first with: MATLABPOOL OPEN 12
Limited to 12 threads.
Can do reductions.

Parallel Computing Server
This runs on a parallel machine, including Amazon EC2.
Your client sends batch or interactive jobs to it.
Many Matlab toolboxes are not licensed to use it.
This makes it much less useful.

GPU computing
Create an array on device with gpuArray
Run builtin functions on it.
Matlab's run built in functions on a gpu

4.4 Mathematica in parallel
You terminate an input command with shiftenter.
Some Mathematica commands:
Sin[1.] Plot[Sin[x],{x,2,2}] a=Import[ "/opt/parallel/mathematica/mtn1.dat"] Information[a] Length[a] b=ArrayReshape[a,{400,400}] MatrixPlot[b] ReliefPlot[b] ReliefPlot[b,Method>"AspectBasedShading"] ReliefPlot[MedianFilter[b,1]] Dimensions[b] Eigenvalues[b] *When you get bored* * waiting, type * alt. Eigenvalues[b+0.0] Table[ {x^i y^j,x^j y^i},{i,2},{j,2}] Flatten[Table[ {x^i y^j,x^j y^i},{i,2},{j,2}],1] StreamPlot[{x*y,x+y},{x,3,3},{y,3,3}] $ProcessorCount $ProcessorType *Select *Parallel Kernel Configuration* and *Status* in the *Evaluation* menu* ParallelEvaluate[$ProcessID] PrimeQ[101] Parallelize[Table[PrimeQ[n!+1],{n,400,500}]] merQ[n_]:=PrimeQ[2^n1] Select[Range[5000],merQ] ParallelSum[Sin[x+0.],{x,0,100000000}] Parallelize[ Select[Range[5000],merQ]] Needs["CUDALink`"] *note the back quote* CUDAInformation[] Manipulate[n, {n, 1.1, 20.}] Plot[Sin[x], {x, 1., 20.}] Manipulate[Plot[Sin[x], {x, 1., n}], {n, 1.1, 20.}] Integrate[Sin[x]^3, x] Manipulate[Integrate[Sin[x]^n, x], {n, 0, 20}] Manipulate[{n, FactorInteger[n]}, {n, 1, 100, 1}] Manipulate[Plot[Sin[a x] + Sin[b x], {x, 0, 10}], {a, 1, 4}, {b, 1, 4}]
Unfortunately there's a problem that I'm still debugging with the Mathematica  CUDA interface.
6 Cloud computing
The material is from Wikipedia, which appeared better than any other sources that I could find.

Hierarchy:

IaaS (Infrastructure as a Service)
 Sample functionality: VM, storage
 Examples:
 Google_Compute_Engine
 Amazon_Web_Services
 OpenStack : compute, storage, networking, dashboard

PaaS (Platform ...)
 Sample functionality: OS, Web server, database server
 Examples:
 OpenShift
 Cloud_Foundry

Hadoop :
 distributed FS, Map Reduce
 derived from Google FS, map reduce
 used by Facebook etc.
 Now, people often run Apache Spark™  LightningFast Cluster Computing instead of Hadoop, because Spark is faster.

SaaS (Software ...)
 Sample functionality: email, gaming, CRM, ERP

IaaS (Infrastructure as a Service)

Virtual machine.
The big question is, at what level does the virtualization occur? Do you duplicate the whole file system and OS, even emulate the HW, or just try to isolate files and processes in the same OS.
 Virtualization
 Hypervisor
 Xen
 Kernelbased_Virtual_Machine
 QEMU
 VMware
 Containers, docker,
 Comparison_of_platform_virtual_machines

Distributed storage

See also
 VNC

Grid_computing
 decentralized, heterogeneous
 used for major projects like protein folding