PAR Lecture 5, Thu Feb 2

W Randolph Franklin, RPI

2017-02-02 00:00

Comments

Table of contents

1 Compiling and running
2 Homework 2
3 More OpenMP
4 ssh, afs, zfs
5 Types of memory allocation
6 NVidia device architecture, start of CUDA

1 Compiling and running

c++ compilers:
1. geoxeon has 5.4 by default, and 6.2 if you use g++-6
2. parallel has 4.8.5 by default, and 5.3.1 in /opt/devtoolset-4/root/bin/
  
  The problem is that parallel runs CentOS which makes it very hard to install recent versions of SW. Installing gcc-6 might require compiling the source.
Measuring program times when many people are on the machine:
1. This is a problem.
2. One partial solution is to use batch to queue your job to run when the system is unloaded.

2 Homework 2

Homework 2 is online. I'll leave time at the end of this and the next class class for you to start and ask questions.

3 More OpenMP

Another nice, albeit obsolescent and incomplete, ref: Wikibooks OpenMP.
We'll look at its tasks examples.
OpenMP tasks
1. While inside a pragma parallel, you queue up lots of tasks - they're executed as threads are available.
2. My example is tasks.cc with some variants.
3. taskfib.cc is modified from an example in the OpenMP spec.
  1. It shows how tasks can recursively spawn more tasks.
  2. This is only an example; you would never implement fibonacci that way. (The fastest way is the following closed form. In spite of the sqrts, the expression always gives an integer. )
    
    \(F_n = \frac{\left( \frac{1+\sqrt{5}}{2} \right) ^n - \left( \frac{1-\sqrt{5}}{2} \right) ^n }{\sqrt{5}}\)
  3. Spawning a task is expensive; we can calculate the cost.
stacksize.cc - get and set stack size. Can also do ulimit -s.
omps.cc - read assorted OMP variables.
Since I/O is not thread-safe, in hello_single_barrier.cc it has to be protected by a critical and also followed by a barrier.
OpenMP sections - my example is sections.cc with some variants.
1. Note my cool way to print an expression's name followed by its value.
2. Note the 3 required levels of pragmas: parallel, sections, section.
3. The assignment of sections to threads is nondeterministic.
4. IMNSHO, OpenMP considerably easier than pthreads, fork, etc.
http://stackoverflow.com/questions/13788638/difference-between-section-and-task-openmp
What you should have learned from the OpenMP lectures:
1. How to use a well-designed and widely used API that is useful on almost all current CPUs.
2. Shared memory: the most common parallel architecture.
3. How to structure a program to be parallelizable.
4. Issues in parallel programs, like nondeterminism, race conditions, critical regions, etc.

4 ssh, afs, zfs

I recommend that you create key pairs to connect to geoxeon and parallel w/o typing your password each time.
To avoid retyping your ssh private key passphrase in the same shell, do this

ssh-add
One advantage of ssh is that you can mount your geoxeon directory on your local machine. On my laptop, I run nemo, then do File - Connect to server, choose type ssh, server geoxeon.ecse.rpi.edu, use my RCSID as the user name, leave the password blank, and give my private key passphrase if asked. Any other program to mount network shares would work as well.

Where the share is actually mounted varies. On my laptop, it's here: /var/run/user/1000/gvfs/sftp:host=geoxeon.ecse.rpi.edu,user=wrf

As with any network mount, fancy filesystem things like attributes, simultaneous reading and writing from different machines, etc., probably won't work.
With ssh, you can also copy files back and forth w/o mounting or typing passwords:
1. scp localfile geoxeon.ecse.rpi.edu:
2. scp -r localdir geoxeon.ecse.rpi.edu:
3. scp geoxeon.ecse.rpi.edu:remotefile .
It even does filename completion on remote files.
You can also run single commands:

ssh geoxeon.ecse.rpi.edu hostname
Geoxeon implements AFS, so your RCS files are accessible (read and write) at

/afs/rpi/edu/home/??/RCSUID.

Search for your home dir thus:

ls -ld /afs/rpi.edu/home//RCSID*.

Authenticate yourself with klog.
On geoxeon, your files are transparently compressed, so there's no need to explicitly use gzip etc.

However, files on parallel are not transparently compressed.

5 Types of memory allocation

Here's a brief overview of my understanding of the various places that you can assign memory in a program.

Static. Define a fixed-size array global array. The variable is constructed at compile time, so accesses might perhaps be faster. Global vars with non default initial values increase the executable file size. If they're large enough, you need to use the compiler option -mcmodel=medium or -mcmodel=large. I don't know the effect on the program's size or speed.
Stack. Define local arrays, that are created and freed as the routine is entered and exited. Their addresses relative to the base of this call frame may be constant. The default stack size is 8MB. You can increase this with the command ulimit or in the program as shown in stacksize.cc. I believe that in OpenMP, the max stacksize may be allocated when each thread is created. Then, a really big stackssize might have a penalty.
Heap. You use new and destroy. Variables are constructed whenever you want. The more objects on the heap, the more time that each new or destroy takes. If you have lots of objects consider using placement new or creating an array of them.

I like to use static, then stack, and heap only when necessary.

Google's allocator is noticeably better than the default one. To use it, link your programs with -ltcmalloc. You can often use it on an existing executable foo thus:

LD_PRELOAD="/usr/lib/libtcmalloc.so" foo

I found it to save 15% to 30% in time.

Another memory concern is speed. Geoxeon has a NUMA (Non Uniform Memory Architecture). It has two 8-core Xeons. Each core has 64GB of main memory. Although all 128GB are in a common address space, accessing memory on same core as the thread is running on is faster.

The following is what I think based on some research, but may be wrong: A 4KB page of memory is assigned to a specific core when it is first written (not when it is reserved). So, each page of a large array may be on a different core. This can be used to optimize things. This gets more fun with 8-processor systems.

All that is separate from cache issues.

You can also assign your OpenMP threads to specific cores. This affects speed in ways I don't understand. The issues are resource sharing vs conflicts.

6 NVidia device architecture, start of CUDA

The above shared memory model hits a wall; CUDA handles the other side of the wall.