.. title: PAR Class 12, Wed 2018-04-18
.. slug: class12
.. date: 2018-04-18
.. tags: class
.. category:
.. link:
.. description:
.. type: text
.. raw:: html
.. role:: red
.. role:: blue
.. sectnum::
.. contents:: Table of contents
Git
--------------------------------------
Git is good to simultaneously keep various versions. A git
intro:
Create a dir for the project::
mkdir PROJECT; cd PROJECT
Initialize::
git init
Create a branch (you can do this several times)::
git branch MYBRANCHNAME
Go to a branch::
git checkout MYBRANCHNAME
Do things::
vi, make, ....
Save it::
git add .; git commit -mCOMMENT
Repeat
I might use this to modify a program for class.
Software tips
--------------------------------------
Freeze decisions early: SW design paradigm
===========================================
One of my rules is to push design decisions to take effect as
early in the process execution as possible. Constructing
variables at compile time is best, at function call time is
2nd, and on the heap is worst.
a. If I have to construct variables on the heap, I construct few and large
variables, never many small ones.
#. Often I compile the max dataset size into the program, which permits
constructing the arrays at compile time. Recompiling for a larger
dataset is quick (unless you're using CUDA).
Accessing this type of variable uses one less level of pointer than
accessing a variable on the heap. I don't know whether this is faster
with a good optimizing compiler, but it's probably not slower.
#. If the data will require a dataset with unpredictably sized components, such as a ragged array, then I may do the following.
i. Read the data once to accumulate the necessary statistics.
#. Construct the required ragged array.
#. Reread the data and populate the array.
Faster graphical access to parallel.ecse
===========================================
X over ssh is very slow.
Here are some things I've discovered that help, and that work sometimes.
#. Use **xpra**; here's an example:
a. On parallel.ecse::
xpra start :77; DISPLAY=:77 xeyes&
Don't everyone use 77, pick your own numbers in the range 20-99.
#. On server, i.e., your machine::
xpra attach ssh:parallel.ecse.rpi.edu:77
#. I suspect that security is weak. When you start an xpra session, In suspect that anyone on parallel.ecse can display to it. I suspect that anyone with ssh access to parallel.ecse can try to attach to it, and the that 1st person wins.
#. Use **nx**, which needs a server, e.g., `FreeNX `_.
Jack Dongarra videos
--------------------------------------
#. `Sunway TaihuLight's strengths and weaknesses highlighted `_. 9 min. 8/21/2016.
This is the new fastest known machine on top500. A machine with many Intel Xeon Phi coprocessors is now 2nd, Nvidia K20 is 3rd, and some machine built by a company down the river is 4th. These last 3 machines have been at the top for a surprisingly long time.
#. `An Overview of High Performance Computing and Challenges for the Future `_. 57min, 11/16/2016.
More parallel tools
--------------------------------------
cuFFT Notes
===========================================
#. `GPU Computing with CUDA `_ Lecture 8 - CUDA Libraries - CUFFT, PyCUDA from Christopher Cooper, BU
#. `video `_ #8 -
CUDA 5.5 cuFFT FFTW API Support. 3 min.
#. cuFFT is inspired by FFTW (the fastest Fourier transform in the west),
which they say is so fast that it's as fast as commercial FFT packages.
#. I.e., sometimes commercial packages may be worth the money.
#. Although the FFT is taught for N a power of two, users often want to
process other dataset sizes.
#. The problem is that the optimal recursion method, and the relevant
coefficients, depends on the prime factors of N.
#. FFTW and cuFFT determine the good solution procedure for the particular N.
#. Since this computation takes time, they store the method in a **plan**.
#. You can then apply the plan to many datasets.
#. If you're going to be processing very many datasets, you can tell FFTW or
cuFFT to perform sample timing experiments on your system, to help in
devising the best plan.
#. That's a nice strategy that some other numerical SW uses.
#. One example is `Automatically Tuned Linear Algebra Software (ATLAS) `_.
cuBLAS etc Notes
===========================================
#. `BLAS `_ is an API for a set of simple matrix
and vector functions, such as multiplying a vector by a matrix.
#. These functions' efficiency is important since they are the basis for
widely used numerical applications.
#. Indeed you usually don't call BLAS functions directly, but use
higher-level packages like LAPACK that use BLAS.
#. There are many implementations, free and commercial, of BLAS.
#. cuBLAS is one.
#. One reason that Fortran is still used is that, in the past, it was easier
to write efficient Fortran programs than C or C++ programs for these
applications.
#. There are other, very efficient, C++ numerical packages. (I can list
some, if there's interest).
#. Their efficiency often comes from aggressively using C++ templates.
#. `Matrix mult example `_
Matlab
===========================================
#. Good for applications that look like matrices.
Considerable contortions required for, e.g., a general graph. You'd
represent that with a large sparse adjacency matrix.
#. Using explicit *for* loops is slow.
#. Efficient execution when using builtin matrix functions,
but can be difficult to write your algorithm that way, and
difficult to read the code.
#. Very expensive and getting more so.
Many separately priced apps.
#. Uses state-of-the-art numerical algorithms.
E.g., to solve large sparse overdetermined linear systems.
Better than Mathematica.
#. Most or all such algorithms also freely available as C++ libraries.
However, which library to use?
Complicated calling sequences.
Obscure C++ template error messages.
#. Graphical output is mediocre.
Mathematica is better.
#. Various ways Matlab can execute in parallel
a. Operations on arrays can execute in parallel.
E.g. B=SIN(A) where A is a matrix.
#. Automatic multithreading by some functions
Various functions, like INV(a), automatically use perhaps 8 cores.
The '8' is a license limitation.
`Which MATLAB functions benefit from multithreaded computation? `_
#. PARFOR
Like FOR, but multithreaded.
However, FOR is slow.
Many restrictions, e.g., cannot be nested.
Matlab's `introduction to parallel solutions `_
Start pools first with: MATLABPOOL OPEN 12
Limited to 12 threads.
Can do reductions.
#. Parallel Computing Server
This runs on a parallel machine, including Amazon EC2.
Your client sends batch or interactive jobs to it.
Many Matlab toolboxes are not licensed to use it.
This makes it much less useful.
#. GPU computing
Create an array on device with gpuArray
Run builtin functions on it.
Matlab's `run built in functions on a gpu `_
Mathematica in parallel
===========================================
You terminate an input command with *shift-enter*.
Some Mathematica commands::
Sin[1.]
Plot[Sin[x],{x,-2,2}]
a=Import[
"/opt/parallel/mathematica/mtn1.dat"]
Information[a]
Length[a]
b=ArrayReshape[a,{400,400}]
MatrixPlot[b]
ReliefPlot[b]
ReliefPlot[b,Method->"AspectBasedShading"]
ReliefPlot[MedianFilter[b,1]]
Dimensions[b]
Eigenvalues[b] *When you get bored*
* waiting, type * alt-.
Eigenvalues[b+0.0]
Table[ {x^i y^j,x^j y^i},{i,2},{j,2}]
Flatten[Table[ {x^i y^j,x^j y^i},{i,2},{j,2}],1]
StreamPlot[{x*y,x+y},{x,-3,3},{y,-3,3}]
$ProcessorCount
$ProcessorType
*Select *Parallel Kernel Configuration*
and *Status* in the *Evaluation* menu*
ParallelEvaluate[$ProcessID]
PrimeQ[101]
Parallelize[Table[PrimeQ[n!+1],{n,400,500}]]
merQ[n_]:=PrimeQ[2^n-1]
Select[Range[5000],merQ]
ParallelSum[Sin[x+0.],{x,0,100000000}]
Parallelize[ Select[Range[5000],merQ]]
Needs["CUDALink`"] *note the back quote*
CUDAInformation[]
Manipulate[n, {n, 1.1, 20.}]
Plot[Sin[x], {x, 1., 20.}]
Manipulate[Plot[Sin[x], {x, 1., n}], {n, 1.1, 20.}]
Integrate[Sin[x]^3, x]
Manipulate[Integrate[Sin[x]^n, x], {n, 0, 20}]
Manipulate[{n, FactorInteger[n]}, {n, 1, 100, 1}]
Manipulate[Plot[Sin[a x] + Sin[b x], {x, 0, 10}],
{a, 1, 4}, {b, 1, 4}]
Unfortunately there's a problem that I'm still debugging with the
Mathematica - CUDA interface.
Nvidia videos
--------------
#. `HPC and Supercomputing at GTC 2017 `_ 1 min
#. `NVIDIA Self-Driving Car Demo at CES 2017 `_ 2 min
#. `How Nvidia Went From Gaming to GPUs `_ 3 min
#. `Nvidia Volta To Be Released Q3 2017, Say Rumours | RX 480 Can be Flashed to RX 580 `_ 10 min. 4/19/2017
#. `NVIDIA Opening Keynote Highlights at CES 2017 `_ 37 min
Cloud computing
----------------
The material is from Wikipedia, which appeared better than any other
sources that I could find.
#. Hierarchy:
a. `IaaS `_ (Infrastructure as a Service)
i. Sample functionality: VM, storage
#. Examples:
#. `Google_Compute_Engine `_
#. `Amazon_Web_Services `_
#. `OpenStack `_ : compute, storage, networking, dashboard
#. `PaaS `_ (Platform ...)
i. Sample functionality: OS, Web server, database server
#. Examples:
a. `OpenShift `_
#. `Cloud_Foundry `_
#. `Hadoop `_ :
#. distributed FS, Map Reduce
#. derived from Google FS, map reduce
#. used by Facebook etc.
#. Now, people often run `Apache Sparkā¢ - Lightning-Fast Cluster Computing `_ instead of Hadoop, because Spark is faster.
#. `SaaS `_ (Software ...)
i. Sample functionality: email, gaming, CRM, ERP
#. `Cloud_computing_comparison `_
#. Virtual machine.
The big question is, at what level does the virtualization occur? Do you duplicate the whole file system and OS, even emulate the HW, or just try to isolate files and processes in the same OS.
a. `Virtualization `_
#. `Hypervisor `_
#. `Xen `_
#. `Kernel-based_Virtual_Machine `_
#. `QEMU `_
#. `VMware `_
#. Containers, docker,
#. `Comparison_of_platform_virtual_machines `_
#. Distributed storage
a. `Virtual_file_system `_
#. `Lustre_(file_system) `_
#. `Comparison_of_distributed_file_systems `_
#. `Hadoop_distributed_file_system `_
#. See also
a. `VNC `_
#. `Grid_computing `_
i. decentralized, heterogeneous
#. used for major projects like protein folding