Week 1
Thurs 2014-01-23 (L1)
General
- The syllabus is on the course home page.
- My local cache of material from the web is here. (Aside: It's not true that material on the web is permanent. A number of sites that I've used in classes have vanished. For some of them, the only online version is my cache, which now pops up in google searches.)
- Although I generally recommend clang++ over g++, for this course, you need to use g++. OpenMP is not yet in the clang trunk, and nvcc assumes g++.
- Homework 1 is online, due Monday (it's short).
OpenMP
The first 3 lectures will be on OpenMP (and an intro to parallel computation). There are several good online references.
- http://en.wikipedia.org/wiki/OpenMP good summary.
- A series of video tutorials with slides by Mattson @ Intel:
- http://openmp.org/wp/2013/12/tutorial-introduction-to-openmp/
- https://www.youtube.com/playlist?list=PLLX-Q6B8xqZ8n8bwjGdzBJ25X2utwnoEG
- Official site, with specs etc: http://openmp.org/
- https://computing.llnl.gov/tutorials/openMP/ - detailed tutorial with many examples. LLNL also has a good parallel tutorial: https://computing.llnl.gov/tutorials/parallel_comp/
- https://sharepoint.campus.rwth-aachen.de/units/rz/HPC/public/Shared%20Documents/Forms/ParProg%202013.aspx also looks quite good, and has exercises.
Week 2
Mon 2014-01-27 (L2)
Top 500 machines
Mentioned last Thursday: http://top500.org/
SSH to my lab computer
Before using my lab computer, you need to send me:
- your name (unless it's in your email).
- your RCS username (unless you emailed using it).
- your SSH public key (and also you install your private key on your computer). I'll let you research what this means and how to create it.
Then you access it thus: ssh geoxeon.ecse.rpi.edu .
LMS
Why do I use it? It's widely used at RPI. It's good for logging in homeworks and handing back grades. However it's pretty bad at most everything else.
OpenMP
- The class is encouraged to use their laptops to run these programs with me.
- Look at the summary card: OpenMP3.1-CCard.pdf IMO you won't use most of it.
cat /proc/cpuinfo
shows CPU cores.- Work with the program Gfiles:openmp/mattson/OMP_exercises/pi.c
- Run it as is.
- Add a
parallel
pragma. - Add a
single
to announce the loop. - Run with 1 thread, then 2, 4.
- Look at the time and result.
- Put a
critical
around sum and rerun. - Try a
reduction
pragma. - Why is the result wrong?
- Fix it.
- Try on
geoxeon
with up to 32 threads. - Why is it not 32x faster?
- Use a
critical
then acapture
pragma to get a unique number for each iteration, and compare times.
- Gfiles:openmp/mattson/OMP_exercises/matmul.c
- How to parallelize?
- Test.
Thurs 2014-01-30 (L3)
- Exercises in https://computing.llnl.gov/tutorials/openMP/exercise.html . Also the description of OpenMP is excellent.
- Some programs in http://people.sc.fsu.edu/~%20jburkardt/c_src/openmp/openmp.html
- http://people.sc.fsu.edu/~%20jburkardt/c_src/md_openmp/md_openmp.html - for
- http://people.sc.fsu.edu/~%20jburkardt/c_src/fft_openmp/fft_openmp.c - for
- http://people.sc.fsu.edu/~%20jburkardt/c_src/multitask_openmp/multitask_openmp.html - shows sections
- http://people.sc.fsu.edu/~%20jburkardt/cpp_src/dijkstra_openmp/dijkstra_openmp.cpp - shows single, critical, barrier
- http://people.sc.fsu.edu/~%20jburkardt/c_src/heated_plate_openmp/heated_plate_openmp.c
- http://sc.tamu.edu/shortcourses/SC-openmp/OpenMPSlides_tamu_sc.pdf - another nice tutorial
- http://stackoverflow.com/questions/13065943/task-based-programming-pragma-omp-task-versus-pragma-omp-parallel-for/13068512#13068512 - tasks
- Homework 2, due next Thurs.
Week 3
Mon 2014-02-03 (L4)
- OpenMP sections - my example is sections.cc with some variants.
- Note my cool way to print an expression's name followed by its value.
- Since I/O is not thread-safe, it has to be protected by a critical and also followed by a barrier.
- Note the 3 required levels of pragmas: parallel, sections, section.
- The assignment of sections to threads is nondeterministic.
- IMNSHO, OpenMP considerably easier than pthreads, fork, etc.
- OpenMP tasks
- While inside a
pragma parallel
, you queue up lots of tasks - they're executed as threads are available. - Here's one reasonable ref: https://iwomp.zih.tu-dresden.de/downloads/omp30-tasks.pdf
- My example is tasks.cc with some variants.
- taskfib.cc is modified from an example in the OpenMP spec.
- It shows how tasks can recursively spawn more tasks.
- This is only an example; you would never implement fibonacci that way. (The fastest way is the following closed form. In spite of the sqrts, the expression always gives an integer. ) {$$ F_n = \frac{\left( \frac{1+\sqrt{5}}{2} \right) ^n - \left( \frac{1-\sqrt{5}}{2} \right) ^n }{\sqrt{5}} $$}
- Spawning a task is expensive; we can calculate the cost.
- While inside a
- The cost of false sharing, using variants of
falsesharing.cc, which is modified from the
Wikipedia article.
- The cost is really complicated; I don't understand it well.
- OpenMP 4.0
- http://www.openmp.org/mp-documents/OpenMP4.0.0.pdf
- OpenMP 4 - What's New?
- Since it was just published last summer, it hasn't been fully implemented yet.
- What you should have learned from the OpenMP lectures:
- How to use a well-designed and widely used API that is useful on almost all current CPUs.
- Shared memory: the most common parallel architecture.
- How to structure a program to be parallelizable.
- Issues in parallel programs, like nondeterminism, race conditions, critical regions, etc.
NVidia device architecture, start of CUDA
- The above shared memory model hits a wall; CUDA handles the other side of the wall.
- I'll lecture from customized versions of the Stanford notes.
Thurs 2014-02-06 (L5)
Comments About Memory
- In a C++ program, on the host, there are (at least) 3 classes of memory
for a variable.
- Global. The variable is constructed at compile time, so accesses might perhaps be faster. Global vars with non default initial values increase the executable file size. There may be an undocumented limit to the total size of all global variables, with an obscure error message.
- Local, on the stack. The variable is constructed when the function is called; its address relative to the base of this call frame may be constant. There may be an undocumented limit to the max stack size, with an obscure error message.
- On the heap. Variables are constructed whenever you want. However, constructing and destroying many variables, especially of varying sizes, fragments the heap. That makes the time superlinear -- each successive construction takes more time.
- My rule is to push design decisions as early in the process as possible.
Constructing variables at compile time is best, at function call time is
2nd, and on the heap is worst.
- If I have to construct variables on the heap, I construct few and large variables, never many small ones.
- Often I compile the max dataset size into the program, which permits constructing the arrays at compile time. Recompiling for a larger dataset is quick (unless you're using CUDA). Accessing this type of variable uses one less level of pointer than accessing a variable on the heap. I don't know whether this is faster with a good optimizing compiler, but it's probably not slower.
- If the data will require a dataset with unpredictably sized components, such as a ragged array, then I may do the following.
- Read the data once to accumulate the necessary statistics.
- Construct the required ragged array.
- Reread the data and populate the array.
- With CUDA, the dominant problem in program optimization is optimizing the data flow. Getting the data quickly to the cores is harder than processing it. It helps big to have regular arrays, where each core reads or writes a successive entry. This is analogous to the hardware fact that wires are bigger (hence, more expensive) than gates.
- That is the opposite optimization to OpenMP, where having different threads writing to adjacent addresses will cause the false sharing problem.
- lecture_3/cuda_threads_and_atomics.pdf. The containing directory is this.
Week 4
Mon 2014-02-10 (L6)
Lecture material
- lecture_4/cuda_memories.pdf
- Tutorial programs from tutorials/.
- sum_reduction.cu
Thurs 2014-02-13
Class canceled because of snow.
Week 5
Tues 2014-02-18 (L7)
GPU summary
Here's a summary of the GPU architecture as I understand it. It's more compact than I've found elsewhere. I'll add to it from time to time. Quoted sizes are for Kepler GK110 with Compute Capability 3.5. Many of the features are new with 3.5.
- The host is the CPU.
- The device is the GPU.
- The device contains 14 streaming multiprocessor extremess (SMXs). (This is for Kepler. For Fermi, they were called streaming multiprocessors (SMs)).
- A thread is a sequential program with private and shared memory, program counter, etc.
- Threads are grouped, 32 at a time, into warps.
- Warps of threads are grouped into blocks. Often the warps are only implicit, and we consider that the threads are grouped directly into blocks.
- Blocks of threads are grouped into a grid, which is all the threads in the kernel.
- A kernel is a parallel program executing on the device.
- The kernel runs potentially thousands of threads.
- A kernel can create other kernels and wait for their completion.
- There may be a limit, e.g., 5 seconds, on a kernel's run time.
- Thread-level resources:
- Each thread can use up to 255 fast registers. Registers are private to the thread. All the threads' registers are allocated from a fixed pool. The more registers that each thread uses, the fewer threads that can run simultaneously.
- Each thread has 512KB slow local memory, allocated from the global memory.
- Local memory is used when not enough registers are available, and to store thread-local arrays.
- Warp-level resources:
- Threads are grouped, 32 at a time, into warps.
- Each warp executes as a SIMD, with one instruction register. At each cycle, every thread in a warp is either executing the same instruction, or is disabled. If the 32 threads want to execute 32 different instructions, then they will execute one after the other, sequentially. If you read in some NVidia doc that threads in a warp run independently, then continue reading the next page to get the info mentioned in the previous paragraph.
- If successive instructions in a warp do not depend on each other, then, if there are enough warp schedulers available, they may be executed in parallel. This is called Instruction Level Parallelism (ILP).
- For an array in local memory, which means that each thread will have its private copy, the elements for all the threads in a warp are interleaved to potentially increase the I/O rate. Therefore your program should try to have successive threads read successive words of arrays.
- A thread can read variables from other threads in the same warp, with the shuffle instruction. Typical operation are to read from the K-th next thread, to do a butterfly permutation, or to do an indexed read. This happens in parallel for the whole warp, and does not use shared memory.
- A warp vote combines a bit computed by each thread to report results like all or any.
- Block-level resources:
- A block may contain up to 1024 threads.
- Each block has access to 65536 fast 32-bit registers, for the use of its threads.
- Each block can use up to 49152 bytes of the SMX's fast shared memory. The block's shared memory is shared by all the threads in the block, but is hidden from other blocks. Shared memory is basically a user-controllable cache of some global data. The saving comes from reusing that shared data several times after you loaded it from global memory once.
- Warps in a block run asynchronously and run different instructions. They are scheduled and executed as resources are available.
- The threads in a block can be synchonized with __syncthreads(). Because of how warps are scheduled, that can be slow.
- The threads in a block can be arranged into a 3D array, up to 1024x1024x64. That is for convenience, and does not increase performance.
- I'll talk about textures later.
- Streaming Multiprocessor (SMX) - level resources:
- Each SMX has 192 single-precision CUDA cores, 64 double-precision units, 32 special function units, and 32 load/store units.
- A CUDA core is akin to an ALU. The cores, and all the units, are pipelined.
- A CUDA core is much less powerful than one core of an Intel Xeon. My guess is 1/20th.
- Beware that, in the CUDA C Programming Guide, NVidia sometimes calls an SMX a core.
- The limited number of, e.g., double precision units means that an DP instruction will need to be scheduled several times for all the threads to execute it. That's why DP is slower.
- Each SMX has 32K fast 4-byte registers, to be divided among its threads. (Yes this sounds inconsistent with the number of registers per block. When I figure it out, I'll edit this. I'm synthesizing this doc from several sources, some of which are older.)
- Each SMX has 4 warp schedulers and 8 instruction dispatch units.
- 64 warps can simultaneously reside in an SMX.
- Therefore up to 32x64=2048 threads can be executed in parallel by an SMX.
- Up to 16 blocks that can simultaneously be resident in an SMX. However, if each block uses too many resources, like shared memory, then this number is reduced. Each block sits on only one SMX; no block is split. However a block's warps are executed asynchronously (until synced).
- Each SMX has 64KiB fast memory to be divided between shared memory and an L1 cache. Typically, 48KiB is used for the shared memory, to be divided among its resident blocks, but that can be changed.
- The L1 cache can cache local or global memory.
- Each SMX has a read-only data cache of 48KB to cache the global constant memory.
- Each SMX has 8 profile counters that can be incremented by any thread.
- Grid-level resources:
- The blocks in a grid can be arranged into a 3D array. up to {$ (2^{31}-1, 2^{16}-1, 2^{16}-1) $}.
- Blocks in a grid are queued and executed as resources are available, in an unpredictable parallel or serial order. Therefore they should be independent of each other. Trying to sync blocks can cause a deadlock.
- The number of instructions in a kernel is limited. 2M.
- Any thread can stop the kernel by calling assert.
- Device-level resources:
- There is a large and slow 6GB global memory, which persists from kernel to kernel. Transactions to global memory are 128 bytes. Host memory can also be memory-mapped into global memory, although the I/O rate will be lower. Reading from global memory can take hundreds of cycles. A warp that does this will be paused and another warp started. Such context switching is very efficient. Therefore device throughput stays high, although there is a latency. This is called '''Thread Level Parallelism (TLP)''' and is a major reason for GPU performance. That assumes that an SMX has enough active warps that there is always another warp available for execution. That is a reason for having warps that do not use all the resources (registers etc) that they're allowed to.
- There is a 1536KB L2 cache, for sharing data between SMXs.
- There is a 64KiB Small and fast global constant memory, , which also persists from kernel to kernel. It is implemented as a piece of the global memory, made fast with caches. (Again, I'm still resolving this apparent contradiction).
- Grid Management Unit (GMU) schedules (pauses, executes, etc) grids on the device. This is more important because grids can now start other grids (Dynamic Parallelism).
- Hyper-Q: 32 simultaneous CPU tasks can launch kernels into the queue; they don't block each other. If one kernel is waiting, another runs.
- CUDA Work Distributor (CWD) dispatches 32 active grids at a time to the SMXs. There may be 1000s of grids queued and waiting.
- GPU Direct: Other devices can DMA the GPU memory.
- Refs:
- http://www.geeks3d.com/20100606/gpu-computing-nvidia-cuda-compute-capability-comparative-table/
- The CUDA program deviceDrv.
- http://www.nvidia.com/content/PDF/kepler/NVIDIA-Kepler-GK110-Architecture-Whitepaper.pdf
- Better Performance at Lower Occupancy, Vasily Volkov, UC Berkeley, 2010.
- https://www.pgroup.com/lit/articles/insider/v2n1a5.htm - well written but old.
- http://www.theregister.co.uk/2012/05/15/nvidia_kepler_tesla_gpu_revealed/
(I'll keep adding to this. Suggestions are welcome.)
- CUDA function qualifiers:
- __global__ device function called from host, starting a kernel.
- __device__ device function called from device function.
- __host__ (default) host function called from host function.
- CUDA variable qualifiers:
- __shared__
- __device__ global
- __constant__
- (nothing) register if scalar, or local if array or if no more registers available.
Installing NVidia drivers; System in low graphics mode
Here are some relevant pages.
- http://askubuntu.com/questions/141606/how-to-fix-the-system-is-running-in-low-graphics-mode-error
- http://askubuntu.com/questions/61396/installing-nvidia-drivers/61433#61433
- http://techiesf1.blogspot.co.uk/2013/03/ubuntu-is-running-in-low-graphics-mode.html
- http://askubuntu.com/questions/323108/error-system-is-running-in-low-graphics-mode-solved-only-partially
CUDA FAQs
Here are some that I've found useful.
- https://developer.nvidia.com/cuda-faq
- http://stackoverflow.com/questions/tagged/cuda?sort=faq
- http://www.geforce.com/hardware/technology/cuda/faq
Useful commands etc
nvidia-smi
nvcc -o foo -lineinfo -G -O3 -arch=sm_35 --ptxas-options=-v foo.cu -lglut
Don't always use all of the above options at once.
ACM webcast: "Achieve Massively Parallel Acceleration with GPUs"
Thurs 2/27, 1pm, http://www.acm.org/news/featured/webinars
Lecture material
- Review tutorials/sum_reduction.cu
- lecture_4/cuda_memories.pdf from slide 39
- lecture_5/performance_considerations.pdf/
- Tutorial programs from tutorials/.
- matrix_multiplication.cu
- shared_variables.cu
Thurs 2014-02-20 (L8)
- lecture_6/parallel_patterns_1.pdf, up through slide 26.
- Adjourn at 4:50 for the Bloomberg Tech Talk in CII3045.
- Homework 3 is online, due in a week.
Week 6
Mon 2014-02-24 (L9)
CUDA Changes
Since the Stanford slides, CUDA has been enhanced in a few ways.
- Typed pointers. Pointers contain info labeling them as on the host or device, so that copying is easier.
- Unified Virtual Addressing (UVA). Memory can be allocated on the
host, pinned into real memory, and directly referenced by a device
function via a special pointer. Properties:
- No need to copy data from the host into device global memory. That saves your time and machine time.
- However each access to host memory is even slower than to device global memory.
- At some point, copying the host data to device memory becomes advisable.
- Unified memory. This is new in CUDA 6, just released last week, and installed on geoxeon today. This supplements UVA by automatically caching host data in device global memory.
NVidia docs
My local web cache of all relevant NVidia docs is here (CUDA 6 will be added soon). They are also on geoxeon in /local/cuda-{5.5,6.0}/docs/
Lecture material
- Programs from Gfiles:rpi/:
- lecture_6/parallel_patterns_1.pdf, up through slide 26.
- lecture_7/parallel_patterns_2.pdf.
Thurs 2014-02-27 (L10)
Comparison of array storage modes
The programs are add2, add3, add5. They add two 400,000,000 element integer arrays, each 1.6GB, using geoxeon.
- This ignores array allocations, mallocs, etc, which take 1500 - 2000 msecs.
- It also ignores initializing the arrays by calling rand() 400M times, which takes 7000 msec, and is, by far, the slowest part of the whole program.
- Using the CPU: Adding the arrays takes about 500 msec (using -O3).
- add2.cu: Creating host arrays, using cudaMemcpy to the device, running a kernel, using cudaMemcpy back:
- Copying a 1.6GB array either way: 500 msec.
- Running the kernel (blocksize: 256) : 30 msec.
- Total about 1600 msec.
- add3.cu: Using UVA, creating 3 pinned host arrays with cudaHostAlloc, and running a kernel that directly accesses them.
- Running the kernel (blocksize: 256) : 530 msec.
- Total about 600 msec.
- add5.cu: Using CUDA 6.0 Unified Memory,
- Each CudaMallocManaged: 500 msec.
- Kernel: 545 msec.
- Total: about 2000 msec.
Notes:
- Timing using host calls like clock() agrees with using CUDA events, if things are synchronized.
- From fastest to slowest:
- CPU.
- UVA.
- Allocating host array, cudaMemcpy to device, kernel, and cudaMemcpy back.
- Unified memory.
- However, at this stage don't worry about the times; use the easiest programming technique.
Lecture material
- Textures:
- You read entries from a 1-D, 2-D, or 3-D array. If your subscripts do not hit exactly on an entry, or are outside bounds, the system interpolates a value.
- You choose the interpolation rule: nearest neighbor, linear, etc.
- You choose the out-of-bounds rule: clamp, wrap, etc.
- This is useful whenever interpolation is desired, which is more than just for images.
- http://www.math.ntu.edu.tw/~wwang/mtxcomp2010/download/cuda_04_ykhung.pdf, slides 21 and up.
- 0_Simple/simpleTexture/simpleTexture.cu from the CUDA SDK.
Mon 2014-03-03 (L9)
- Homework 4 is up, due 2014-03-20.
- More texture refs; please read them for Thurs:
- http://www.eecg.toronto.edu/~moshovos/CUDA08/slides/008%20-%20Textures.ppt
- http://www.cudahandbook.com/uploads/Chapter_10._Texturing.pdf
- http://docs.nvidia.com/cuda/cuda-c-programming-guide/#texture-and-surface-memory
- http://cuda-programming.blogspot.com/2013/02/cuda-array-in-cuda-how-to-use-cuda.html
- More texture details:
- A texture can have many layers, all of the same size. This is not a mipmap.
- A cube-map texture has a texture for each face of a cube.
- 2D and 3D cuda arrays, used for textures, are stored in a space-filling curve order, to increase the probability that two elements with similar indices are close in memory.
- Surface memory is like a texture than can be written to. However changes may not be visible until a new kernel is executed.
- CUDA and OpenGL
- You can create a buffer object in OpenGL and then access it from CUDA.
- Thrust
- C++ STL-like layer on top of CUDA.
- Functional-programming philosophy.
- Easier programming, once you get used to it.
- Code is efficient.
- Uses some unusual C++ techniques, like overloading operator().
- lecture_8/introduction_to_thrust.pdf, up through slide 23.
- Various little demo programs I'll upload to the web, and which will be
on geoxeon in
/opt/parallel/thrust
.
Thurs 2014-03-06 (L10)
Thrust experiments
On geoxeon, in /opt/parallel/thrust/sortreduce.cu
, I did some experiments
on sorting and reducing a 500M element vector on the host and on the
device. The times, in seconds:
Operation | Time on host | Time on device |
---|---|---|
Create a vector | .74 | .014 |
Transform a vector with a simple functor | .8 | .012 |
Sort | 11. | .66 |
Reduce | .5 | .013 |
Copy from device to host | .6 |
The device is 20x to 50x faster than the host on these simple operations.
Sorting with Thrust on device
Even if most of your work is on the host, it can pay to offload some work
to the device. E.g., sorting 1,000,000,000 ints takes 6.2 seconds (floats take 8.4
secs). That excludes time to create the array, which you'd have to do
anyway, but includes time for the device to read and write the data (via
unified memory; this is too much data to fit in the device). A complete
program is on geoxeon in /opt/parallel/thrust/devicesort3.cu
and online at
thrust/devicesort3.cu . The
critical code is this:
int *p; cudaHostAlloc(&p, N*sizeof(T), cudaHostAllocMapped); // here write your data into p thrust::sort(thrust::device, p, p+N);
This requires Thrust 1.7 (released last summer), and uses Unified Memory. Getting the program this small took a lot of time.
Sort
has a new argument saying to use the device. There is no need now
for distinguished host and device pointers. This is only one way that CUDA
and Thrust are getting easier to use.
Even a simple op like comparing two arrays is a few times faster on the device (including the access time back to the host). If the data is already on the device, it's many times faster.
One problem with using these new features is that they are sometimes buggy. Today I found an error in thrust::tabulate. A few hours after I reported it, the error was acknowledged and fixed in the development version. That's service!
A few months back, I found an error in boost that was acknowledged and fixed. SW seems to break when I use it.
Thrust intro documentation is here: http://docs.nvidia.com/cuda/thrust/. The git page, with complete documentation, intro documentation, and many examples is here: http://thrust.github.io/.
(Yes I know that citing links like that is considered by some to be bad form. I don't care. It can be useful to have the link URL displayed, e.g., when the page is printed.)
Note that rewriting the examples to use unified memory and version 1.7 would make them shorter and clearer.
Nevertheless, the examples teach you how to easily do things that you mightn't have thought easy at all.
Faster graphical access to geoxeon
X over ssh is very slow.
Here are some things I've discovered that help.
- Use a faster, albeit less secure, cipher:
ssh -c arcfour,blowfish-cbc -C geoxeon.ecse.rpi.edu
- Use xpra; here's an example:
- On geoxeon:
xpra start :77; DISPLAY=:77 xeyes&
Don't everyone use 77, pick your own numbers in the range 20-99. - On server, i.e., your machine:
xpra attach ssh:geoxeon.ecse.rpi.edu:77
- I suspect that security is weak. When you start an xpra session, I suspect that anyone on geoxeon can display to it. I suspect that anyone with ssh access to geoxeon can try to attach to it, and the that 1st person wins.
- On geoxeon:
- Use nx, which needs a server, e.g., FreeNX.
See CUDA graphics
At the end of today's class, we'll try to see the demos in my lab.
Term project
It's time to start thinking of your term projects. They can be anything demonstrating a knowledge of topics covered in this course -- either an implementation or a research paper. Combining your own research with a term project is fine.
Mon 2014-03-17 (L11)
Thrust notes
- In functions like
reduce
andtransform
, you often see an argument likethrust::multiplies<float>()
. The syntax is as follows:thrust::multiplies<float>
is a class.- It overloads
operator()
. - However, in the call to reduce,
thrust::multiplies<float>()
is calling the default constructor to construct a variable of classthrust::multiplies<float>
, and passing it toreduce
. reduce
will treat its argument as a function name and call it with an argument, triggeringoperator()
.- You may also create your own variable of that class, e.g.,
thrust::multiplies<float> foo
. Then you just sayfoo
in the argument list, notfoo()
. - The optimizing compiler will replace the
operator()
function call with the defining expression and then continue optimizing. So, there is no overhead, unlike if you passed in a pointer to a function.
- Sometimes, e.g., in
saxpy.cu
, you seesaxpy_functor(A)
.- The class
saxpy_functor
has a constructor taking one argument. saxpy_functor(A)
constructs and returns a variable of classsaxpy_functor
and storesA
in the variable.- The class also overloads
operator()
. - (Let's call the new variable
foo
).foo()
callsoperator()
forfoo
; its execution uses the storedA
. - Effectively, we did a closure of
saxpy_functor
; this is, we bound a property and returned a new, more restricted, variable or class.
- The class
- The Thrust examples
teach several non-intuitive paradigms. As I figure them out, I'll
describe a few. My descriptions are modified and expanded versions of
the comments in the programs. This is not a list of all the useful
programs, but only of some where I am adding to their comments.
arbitrary_transformation.cu
anddot_products_with_zip.cu
. show the very useful zip_iterator. Using it is a 2-step process.- Combine the separate iterators into a tuple.
- Construct a zip iterator from the tuple.
boundingbox.cu
finds the bounding box around a set of 2D points.- The main idea is to to a reduce. However, the combining operation, instead of addition, is to combine two bounding boxes to find the box around them.
bucket_sort2d.cu
overlays a grid on a set of 2D points and finds the points in each grid cell (bucket).- The tuple is an efficient class for a short vector of fixed length.
- Note how random numbers are generated. You combine an engine that
produces random output with a distribution.
However you might need more complicated coding to make the numbers
good when executing in parallel. See
monte_carlo_disjoint_sequences.cu
. - The problem is that the number of points in each cell is unpredictable.
- The cell containing each point is computed and that and the points are sorted to bring together the points in each cell.
- Then
lower_bound
andupper_bound
are used to find each bucket in that sorted vector of points. - See the lower_bound description in http://thrust.github.io/doc/group__vectorized__binary__search.html .
expand.cu
takes a vector like V= [0, 10, 20, 30, 40] and a vector of repetition counts, like C= [2, 1, 0, 3, 1]. Expand repeats each element of V the appropriate number of times, giving [0, 0, 10, 30, 30, 30, 40]. The process is as follows.- Since the output vector will be longer than the input, the main program computes the output size, byt reduce summing C, and constructs a vector to hold the output.
Exclusive_scan
C to obtain output offsets for each input element: C2 = [0, 2, 3, 3, 6].Scatter_if
the nonzero counts into their corresponding output positions. A counting iterator, [0, 1, 2, 3, 4] is mapped with C2, using C as the stencil, giving C3 = [0, 0, 1, 3, 0, 0, 4].- An
inclusive_scan
with max fills in the holes in C3, to give C4 = [0, 0, 1, 3, 3, 3, 4]. Gather
uses C4 to gather elements of V: [0, 0, 10, 30, 30, 30, 40].
mode.cu
shows:- Counting the number of unique keys in a vector.
- Sort the vector.
- Do an
inner_product
. However, instead of the operators being times and plus, they are not equal to the next element and plus.
- Counting their multiplicity.
- Construct vectors, sized at the number of unique keys, to hold the unique keys and counts.
- Do a
reduce_by_keys
on a constant_iterator using the sorted vector as the keys. For each range of identical keys, it sums the constant_iterator. That is, it counts the number of identical keys. - Write a vector of unique keys and a vector of the counts.
- Finding the most used key (the mode).
- Do
max_element
on the counts vector.
- Do
- Counting the number of unique keys in a vector.
repeated_range.cu
repeats each element of an N-vector K times: repeated_range([0, 1, 2, 3], 2) -> [0, 0, 1, 1, 2, 2, 3, 3]. It's a lite version ofexpand.cu
, but uses a different technique.- Here, N=4 and K=2.
- The idea is to construct a new iterator,
repeated_range
, that, when read and incremented, will return the proper output elements. - The construction stores the relevant info in structure components of the variable.
- Treating its value like a subscript in the range [0,N*K), it divides that value by K and returns that element of its input.
strided_range.cu
andtiled_range.cu
.set_operations.cu
. This shows methods of handling an operation whose output is of unpredictable size. The question is, is space or time more important?- If the maximum possible output size is reasonable, then construct an output vector of that size, use it, and then erase it down to its actual size.
- Or, run the operation twice. The 1st time, write to a
discard_iterator
, and remember only the size of the written data. Then, construct an output vector of exactly the right size, and run the operation again. I use this technique a lot with ragged arrays in sequential programs.
sparse_vector.cu
represents and sums sparse vectors.- A sparse vector has mostly 0s.
- The representation is a vector of element indices and another vector of values.
- Adding two sparse vectors goes as follows.
- Allocate temporary index and element vectors of the max possible size (the sum of the sizes of the two inputs).
- Catenate the input vectors.
- Sort by index.
- Find the number of unique indices by applying
inner_product
with addition and not-equal-to-next-element to the indices, then adding one. E.g., applied to these indices: [0, 3, 3, 4, 5, 5, 5, 8], it gives 5. - Allocate exactly enough space for the output.
- Apply
reduce_by_key
to the indices and elements to add elements with the same keys. The size of the output is the number of unique keys.
Thurs 2014-03-20 (L12)
- Homework 5 is online.
- More Thrust tutorials; the different wording might be more attractive. FYI
only, I won't cover them in class.
- Boston U
- Thrust - A Productivity-Oriented Library for CUDA. This was listed on http://code.google.com/p/thrust/downloads/list .
- Thrust by example, part 2. Some old stuff, well reviewed, and some new stuff.
- Today I'll cover more thrust examples.
Online site: http://code.google.com/p/thrust/source/browse/#hg%2Fexamples%253Fstate%253Dclosed
Geoxeon copy:
/opt/parallel/thrust_examples
.
Week 9
Mon 2014-03-24 (L13)
GPU Techology Conferences
GTC2014 is this week.
Here are some interesting talks from 2013. I'll preview them; if you're interested, watch the complete videos and read the slides.
- Smashing galaxies Smashing Galaxies: A Hierarchical Road from Top to Bottom Jeroen Bedorf (Leiden Observatory, Leiden University), Evghenii Gaburov (SARA, Amsterdam, the Netherlands) Find out how one can leverage massive GPU parallelism to assemble fast sparse octree construction and traverse methods by combining parallel primitives such as scan and sort algorithms. These techniques have culminated in Bonsaia hierarchical gravitational N-body code which is used to study the formation and mergers of galaxies in the present day Universe. With the advent of Kepler''s dynamic parallelism, we explore the new venues that this technology opens for scalable implementations of such hierarchical algorithms. We conclude the session with cutting edge simulations complemented with spectacular visualizations that are produced in collaboration with NVIDIA''s visualization experts.
- Investigating New Numerical Techniques for Reservoir Simulations on GPUs Hatem Ltaief (Supercomputing Laboratory, KAUST), Ahmad Abdelfattah (Center of Extreme Computing, KAUST), Rio Yokota (Center of Extreme Computing, KAUST) Reservoir simulation involve sparse iterative solvers for linear systems that arise from implicit discretizations of coupled PDEs from high-fidelity reservoir simulators. One of the major bottlenecks in these solvers is the sparse matrix-vector product. Sparse matrices are usually compressed in some format (e.g., CSR, ELL) before being processed. In this talk, we focus on the low-level design of a sparse matrix-vector (SpMV) kernel on GPUs. Most of the relevant contributions focus on introducing new formats that suit the GPU architecture such as the diagonal format for diagonal matrices and the blocked-ELL format for sparse matrices with small dense blocks. However, we target both generic and domain-specific implementations. Generic implementations basically target the CSR and ELL formats, in order to be part of the KAUST-BLAS library. More chances for further optimizations appear when the matrix has specific structure. In the talk, we will present the major design challenges and outlines, and preliminary results. The primary focus will be on the CSR format, where some preliminary results will be shown. The other bottleneck of reservoir simulations is the preconditioning in the sparse matrix solver. We investigate the possibility of a Fast Multipole Method based technique on GPUs as a compute-bound preconditioner.
- Deploying General Purpose GPUs in a Manufacturing Environment: Software and CFD Development in Bicycle Manufacturing Joe Dutka (Acer Inc.) The session will share the work of the bicycle company Velocite and researchers at NCKU in Taiwan as they used GPU computing to design their next generation of bicycles. The platform was 2 Acer AT350 F2 servers with both Quadro and Tesla cards and Intel Xeon CPUs for hybrid computation. Rather than a theoretical presentation, the session will focus on real-world implementation and the difficulties overcome to develop software and bike design. Taking place at the same time, the bike will debut in the Taipei Bike Show on March 20.
- CUDA-based Monte Carlo Simulation for Radiation Therapy Dosimetry Nicholas Henderson (Stanford University) Learn about a CUDA adaptation of Geant4, a large-scale Monte Carlo particle physics toolkit. Geant4 was originally designed to support the needs of high energy physics experiments at SLAC, CERN and other places around the world. Geant4 is an extensive toolkit which facilitates every aspect of the simulation process and has been successfully used in many other domains. Current interest is radiation therapy dosimetry. For this application the geometry is simple and the model physics is limited to low energy electromagnetics. These features allow efficient tracking of many particles in parallel on the GPU.
- What Does It Take to Accelerate SPICE on the GPU? Maxim Naumov (NVIDIA) In this talk we will introduce the basic concepts behind The Simulation Program with Integrated Circuit Emphasis (SPICE) and discuss in detail the two most time consuming parts of the circuit simulation: the device model evaluation and the solution of large sparse linear systems. In particular, we focus on the evaluation of the basic models, such as resistor, capacitor and inductor as well as more complex transistor (BSIM4v7) model on the GPU. Also, we discuss the solution of sets of linear systems that are performed throughout the simulation. We take advantage of the fact that the coefficient matrices in these linear systems have the same sparsity pattern (and often end up with the same pivoting strategy) and show how to obtain their solution using a direct method on the GPU. Finally, we present numerical experiments and discuss future work. Co-authors Francesco Lannutti, Sharanyan Chetlur, Lung Sheng Chien, Philippe Vandermersch.
Week 9b
Thurs 2014-03-27 (L14)
2014 GPU Technology Conference
Nvidia made various announcements. However, various pre-announced products from 2013 have silently vanished from the charts.
2013 GPU Technology Conference
Talks on debugging
- http://on-demand.gputechconf.com/gtc/2013/presentations/S3037-S3038-Debugging-CUDA-Apps-Linux-Mac.pdf Comments on particular slides:
- slide 10: on geoxeon, see /opt/parallel/rpi/printf.cu
- slide 13: see /opt/parallel/rpi/mycuda.h
- slide 16: see /opt/parallel/rpi/assert.cu
- slide 24: see /opt/parallel/rpi/race.cu and race2.cu
- http://on-demand.gputechconf.com/gtc/2013/presentations/S3045-Getting-Most-From-GPU-Accelerated-Clusters.pdf
- http://on-demand.gputechconf.com/gtc/2013/presentations/S3051-GPU-Computing-Languages-Libraries-Tools.pdf by Jack Dongarra
Tools to try:
- cuda-memcheck --tool racecheck --racecheck-report all foo
- cuda-gdb foo
- nvprof foo
- nvvp foo
Geoxeon's local copy of a lot of CUDA documentation is in /local/cuda/ .
Week 10
Mon 2014-03-31 (L15)
- Nice Nvidia parallel summary from Oak Ridge - FYI
- GPU Computing with CUDA, Lecture 8 - CUDA Libraries - CUFFT, PyCUDA from Christopher Cooper, BU
cuFFT Notes
- cuFFT is inspired by FFTW (the fastest Fourier transform in the west), which they say is so fast that it's as fast as commercial FFT packages.
- I.e., sometimes commercial packages may be worth the money.
- Although the FFT is taught for N a power of two, users often want to process other dataset sizes.
- The problem is that the optimal recursion method, and the relevant coefficients, depends on the prime factors of N.
- FFTW and cuFFT determine the good solution procedure for the particular N.
- Since this computation takes time, they store the method in a plan.
- You can then apply the plan to many datasets.
- If you're going to be processing very many datasets, you can tell FFTW or cuFFT to perform sample timing experiments on your system, to help in devising the best plan.
- That's a nice strategy that some other numerical SW uses.
- One example is Automatically Tuned Linear Algebra Software (ATLAS) .
cuBLAS etc Notes
- BLAS is an API for a set of simple matrix and vector functions, such as multiplying a vector by a matrix.
- These functions' efficiency is important since they are the basis for widely used numerical applications.
- Indeed you usually don't call BLAS functions directly, but use higher-level packages like LAPACK that use BLAS.
- There are many implementations, free and commercial, of BLAS.
- cuBLAS is one.
- One reason that Fortran is still used is that, in the past, it was easier to write efficient Fortran programs than C or C++ programs for these applications.
- There are other, very efficient, C++ numerical packages. (I can list some, if there's interest).
- Their efficiency often comes from aggressively using C++ templates.
Thurs 2014-04-03 (L16)
Future scheduling
- There will be no class on Mon May 5.
- Therefore class presentations will be the previous week, April 28 and May 1.
Today's student summaries of some of the best GTC 2013 talks
- Wenli Li: GPU-friendly Data Compression, Jens Schneider (King Abdullah University of Science and Technology)
- David Hedin: RocketSim: A GPU Based Simulation Accelerator for Chip-verification, Uri Tal (Rocketick Inc.)
- Dan Benedetti: GPU-Acceleration-For-Hair-Simulation-Rendering.pdf . Francesco Giordana, Sarah Macdonald, and Gianluca Vatinno at Double Negative VFX
- Salles Maghalaes: Performance-Optimization-Strategies-for-GPU-Accelerated-Apps.pdf
- Yi Deng: https://www.udacity.com/course/cs344
Week 11
Mon 2014-04-07 (L17)
More presentations from GTC presentations:
- Yan Ou: Real-time Traffic Sign Recognition on Mobile Processors
- Ruoyang Yao: CUDA-‐based Geant4 Monte Carlo Simulation for Radiation Therapy by N.Henderson & K.Murakami
Supercomputing videos
I'll show that start of various videos in class. If you're interested, you can watch the rest on your own.
- Jack Dongarra: Algorithmic and Software Challenges For Numerical Libraries at Exascale This was presented at the Exascale Computing in Astrophysics Conference 2013, whose Youtube channel is here .
- High Performance Computing Innovation Center at LLNL from the 2014 HPCAC Stanford HPC & Exascale Conference.
- Petros Koumoutsakos: Fast Machines and Cool Algorithms for (Exascale) Flow Simulations from the 2014 HPCAC Stanford HPC & Exascale Conference.
- Architecture-aware Algorithms for Peta and Exascale Computing - Dongarra at SC13.
- Lorena Barba Presents: Flying Snakes on GPUs at SC13.
- Joel Primack: Supercomputing the Universe,
Conference Exascale Computing in Astrophysics.
- There are many interesting talks at Nvidia's GPU technology theater at SC13.
Thurs 2014-04-10 (L18)
Matlab
- Good for applications that look like matrices. Considerable contortions required for, e.g., a general graph. You'd represent that with a large sparse adjacency matrix.
- Using explicit for loops is slow.
- Efficient execution when using builtin matrix functions, but can be difficult to write your algorithm that way, and difficult to read the code.
- Very expensive and getting more so. Many separately priced apps.
- Uses state-of-the-art numerical algorithms. E.g., to solve large sparse overdetermined linear systems. Better than Mathematica.
- Most or all such algorithms also freely available as C++ libraries. However, which library to use? Complicated calling sequences. Obscure C++ template error messages.
- Graphical output is mediocre. Mathematica is better.
- Various ways Matlab can execute in parallel
- Operations on arrays can execute in parallel. E.g. B=SIN(A) where A is a matrix.
- Automatic multithreading by some functions Various functions, like INV(a), automatically use perhaps 8 cores. The '8' is a license limitation. Which MATLAB functions benefit from multithreaded computation?
- PARFOR Like FOR, but multithreaded. However, FOR is slow. Many restrictions, e.g., cannot be nested. http://www.mathworks.com/help/distcomp/introduction-to-parallel-solutions.html Start pools first with: MATLABPOOL OPEN 12 Limited to 12 threads. Can do reductions.
- Parallel Computing Server This runs on a parallel machine, including Amazon EC2. Your client sends batch or interactive jobs to it. Many Matlab toolboxes are not licensed to use it. This makes it much less useful.
- GPU computing Create an array on device with gpuArray Run builtin functions on it. http://www.mathworks.com/help/distcomp/run-built-in-functions-on-a-gpu.html
Week 12
Mon 2014-04-14 (L18)
Student presentations of exciting talks, ctd
- Steve Han
Final project deliverables
- Report to be uploaded to LMS for May 5. If the file is too big, then upload a link to it.
- Talk to the class, on either April 27 or May 1. You may pick the length, but I suggest 20 minutes. A large team might want more time. Please email me your team, your desired length, and your preferred date. I will fill up the days in the order that the emails arrive.
More talks from the GPU Technology Theater at SC13
Denver CO, Nov 17-22, 2013.
http://www.nvidia.com/object/sc13-technology-theater.html#
That page has links to videos and usually also the slides.
- Applications of Programming the GPU Directly from Python Using NumbaPro. Travis Oliphant Co-Founder and CEO Continuum Analytics.
- Tomorrow's Exascale Systems: Not Just Bigger Versions of Today's Peta-Computers Thomas Sterling, Executive Associate Director & Chief Scientist, Indiana University.
Other talk
Joel Primack: Supercomputing the Universe Conference Exascale Computing in Astrophysics, 8-13 Sept 2013.
Thurs 2014-04-17 (L19)
Term project talks
Since no one emailed me with a preference, here is the project presentation schedule.
- Monday, April 27
- Dan Benedetti
- Steve Han
- David Hedin
- Thurs May 1
- Wenli Li
- Salles Maghalaes
- Yi Deng, Yan Ou, Ruoyang Yao
Mathematica in parallel.
You terminate an input command with shift-enter.
Some Mathematica commands;
Sin[1.] Plot[Sin[x],{x,-2,2}] a=Import["/opt/parallel/mathematica/mtn1.dat"] Information[a] Length[a] b=ArrayReshape[a,{400,400}] MatrixPlot[b] ReliefPlot[b] ReliefPlot[b,Method->"AspectBasedShading"] ReliefPlot[MedianFilter[b,1]] Dimensions[b] Eigenvalues[b] When you get bored waiting, type alt-. Eigenvalues[b+0.0] Table[ {x^i y^j,x^j y^i},{i,2},{j,2}] Flatten[Table[ {x^i y^j,x^j y^i},{i,2},{j,2}],1] StreamPlot[{x*y,x+y},{x,-3,3},{y,-3,3}] $ProcessorCount $ProcessorType Select Parallel Kernel Configuration and Status in the Evaluation menu ParallelEvaluate[$ProcessID] PrimeQ[101] Parallelize[Table[PrimeQ[n!+1],{n,400,500}]] merQ[n_]:=PrimeQ[2^n-1] Select[Range[5000],merQ] ParallelSum[Sin[x+0.],{x,0,100000000}] Parallelize[ Select[Range[5000],merQ]] Needs["CUDALink`"] note the back quote CUDAInformation[] Manipulate[n, {n, 1.1, 20.}] Plot[Sin[x], {x, 1., 20.}] Manipulate[Plot[Sin[x], {x, 1., n}], {n, 1.1, 20.}] Integrate[Sin[x]^3, x] Manipulate[Integrate[Sin[x]^n, x], {n, 0, 20}] Manipulate[{n, FactorInteger[n]}, {n, 1, 100, 1}] Manipulate[Plot[Sin[a x] + Sin[b x], {x, 0, 10}], {a, 1, 4}, {b, 1, 4}]
Unfortunately there's a problem that I'm still debugging with the Mathematica - CUDA interface.
Week 13
Mon 2014-04-21 (L20)
http://on-demand.gputechconf.com/gtc/2013/presentations/S3151-Emergent-Numerical-Algorithms.pdf
Thurs 2014-04-24 (L21)
Cloud computing
The material is from Wikipedia, which appeared better than any other sources that I could find.
- Hierarchy:
- IaaS (Infrastructure as a Service)
- Sample functionality: VM, storage
- Examples:
- Google_Compute_Engine
- Amazon_Web_Services
- OpenStack : compute, storage, networking, dashboard
- PaaS (Platform ...)
- Sample functionality: OS, Web server, database server
- Examples:
- OpenShift
- Cloud_Foundry
- Hadoop :
- distributed FS, Map Reduce
- derived from Google FS, map reduce
- used by Facebook etc.
- SaaS (Software ...)
- Sample functionality: email, gaming, CRM, ERP
- IaaS (Infrastructure as a Service)
- Cloud_computing_comparison
- Virtual machine
- Distributed storage
- See also
- VNC
- Grid_computing
- decentralized, heterogeneous
- used for major projects like protein folding
Week 14
Mon 2014-04-28 (L22)
Course recap
- My teaching style is to work from particulars to the general.
- You've seen OpenMP, a tool for shared memory parallelism.
- You've seen the architecture of NVidia's GPU, a widely used parallel system, and CUDA, a tool for programming it.
- You've seen Thrust, a tool on top of CUDA, built in the C++ STL style.
- You've seen how widely used numerical tools like BLAS and FFT have versions built on CUDA.
- You've seen the Matlab and Mathematica development environments, which have tools for parallel programming.
- You've had a chance to program in all of them on geoxeon, my research lab machine with dual 8-core Xeons and K20Xm and K5000 NVidia boards.
- You've seen a summary of cloud computing, which enables parallelism at the level of multiple commodity virtual machines.
- You seen talks by leaders in high performance computing, such as Jack Dongarra.
- Now, you can inductively reason towards general design rules for shared and non-shared parallel computers, and for the SW tools to exploit them.
Term project presentations, 1
- Dan Benedetti, Parallel Project
- Steve Han
- David Hedin
Thurs 2014-05-01 (L22)
Term project presentations, 2
- Wenli Li
- Salles Maghalaes
- Yi Deng, Yan Ou, Ruoyang Yao
Week 15
Mon 2014-05-05
- No class.
- Term projects due. Submit to LMS or email to me. If the file is big, the upload to your favorite server, or put on geoxeon, and send me a link.
After the semester
- Before or after you leave RPI, when I'm not too busy, you're welcome to talk to me about most any legal topic.
- You're welcome to keep using your geoxeon accounts, including for your research indefinitely. Understand that this an unsupported research machine. If it breaks, it might not get fixed immediately. Some capabilities might never get replaced. If the disk crashes, you lose your files. If one of your accounts is used to attack geoxeon, the rules will change. Etc, etc. If geoxeon is of massive help to your research, I and my NSF grant would appreciate acknowledgements in your papers.