Jacobi
Defines | Functions
Jacobi.h File Reference

The header containing the most relevant functions for the Jacobi solver. More...

#include <mpi.h>
#include <stdio.h>
#include <stdlib.h>
#include <cuda_runtime.h>
Include dependency graph for Jacobi.h:
This graph shows which files directly or indirectly include this file:

Defines

#define USE_FLOAT   0
#define DEFAULT_DOMAIN_SIZE   4096
#define MIN_DOM_SIZE   3
#define ENV_LOCAL_RANK   "MV2_COMM_WORLD_LOCAL_RANK"
#define MPI_MASTER_RANK   0
#define JACOBI_TOLERANCE   1.0E-5F
#define JACOBI_MAX_LOOPS   1000
#define STATUS_OK   0
#define STATUS_ERR   -1

Functions

void Initialize (int *argc, char ***argv, int *rank, int *size)
 Initialize the MPI environment, allowing the CUDA device to be selected before (if necessary)
void Finalize (real *devBlocks[2], real *devSideEdges[2], real *devHaloLines[2], real *hostSendLines[2], real *hostRecvLines[2], real *devResidue, cudaStream_t copyStream)
 Close (finalize) the MPI environment and deallocate buffers.
int ParseCommandLineArguments (int argc, char **argv, int rank, int size, int2 *domSize, int2 *topSize, int *useFastSwap)
 Parses the application's command-line arguments.
int ApplyTopology (int *rank, int size, const int2 *topSize, int *neighbors, int2 *topIndex, MPI_Comm *cartComm)
 Generates the 2D topology and establishes the neighbor relationships between MPI processes.
void InitializeDataChunk (int topSizeY, int topIdxY, const int2 *domSize, const int *neighbors, cudaStream_t *copyStream, real *devBlocks[2], real *devSideEdges[2], real *devHaloLines[2], real *hostSendLines[2], real *hostRecvLines[2], real **devResidue)
 This allocates and initializes all the relevant data buffers before the Jacobi run.
void PreRunJacobi (MPI_Comm cartComm, int rank, int size, double *timerStart)
 This function is called immediately before the main Jacobi loop.
void RunJacobi (MPI_Comm cartComm, int rank, int size, const int2 *domSize, const int2 *topIndex, const int *neighbors, int useFastSwap, real *devBlocks[2], real *devSideEdges[2], real *devHaloLines[2], real *hostSendLines[2], real *hostRecvLines[2], real *devResidue, cudaStream_t copyStream, int *iterations, double *avgTransferTime)
 This is the main Jacobi loop, which handles device computation and data exchange between MPI processes.
void PostRunJacobi (MPI_Comm cartComm, int rank, int size, const int2 *topSize, const int2 *domSize, int iterations, int useFastSwap, double timerStart, double avgTransferTime)
 This function is called immediately after the main Jacobi loop.
void SetDeviceBeforeInit ()
 This allows the MPI process to set the CUDA device before the MPI environment is initialized For the CUDA-aware MPI version, the is the only place where the device gets set. In order to do this, we rely on the node's local rank, as the MPI environment has not been initialized yet.
void SetDeviceAfterInit (int rank)
 This allows the MPI process to set the CUDA device after the MPI environment is initialized For the CUDA-aware MPI version, there is nothing to be done here.
void ExchangeHalos (MPI_Comm cartComm, real *devSend, real *hostSend, real *hostRecv, real *devRecv, int neighbor, int elemCount)
 Exchange halo values between 2 direct neighbors This is the main difference between the normal CUDA & MPI version and the CUDA-aware MPI version. In the former, the exchange first requires a copy from device to host memory, an MPI call using the host buffer and lastly, a copy of the received host buffer back to the device memory. In the latter, the host buffers are completely skipped, as the MPI environment uses the device buffers directly.
void CheckCudaCall (cudaError_t command, const char *commandName, const char *fileName, int line)
 The host function for checking the result of a CUDA API call.
real CallJacobiKernel (real *devBlocks[2], real *devResidue, const int4 *bounds, const int2 *size)
 The host wrapper for one Jacobi iteration.
void CopyDeviceBlock (real *devBlocks[2], const int4 *bounds, const int2 *size)
 The host wrapper for copying the updated block over the old one, after a Jacobi iteration finishes.
void CopyDevHalosToBlock (real *devBlock, const real *devHaloLineLeft, const real *devHaloLineRight, const int2 *size, const int *neighbors)
 The host wrapper for copying (unpacking) the values from the halo buffers to the left and right side of the data block.
void CopyDevSideEdgesFromBlock (const real *devBlock, real *devSideEdges[2], const int2 *size, const int *neighbors, cudaStream_t copyStream)
 The host wrapper for copying (packing) the values on the left and right side of the data block to separate, contiguous buffers.

Detailed Description

The header containing the most relevant functions for the Jacobi solver.


Define Documentation

#define DEFAULT_DOMAIN_SIZE   4096

This is the default domain size (when not explicitly stated with "-d" in the command-line arguments).

#define ENV_LOCAL_RANK   "MV2_COMM_WORLD_LOCAL_RANK"

This is the environment variable which allows the reading of the local rank of the current MPI process before the MPI environment gets initialized with MPI_Init(). This is necessary when running the CUDA-aware MPI version of the Jacobi solver, which needs this information in order to be able to set the CUDA device for the MPI process before MPI environment initialization. If you are using MVAPICH2, set this constant to "MV2_COMM_WORLD_LOCAL_RANK"; for Open MPI, use "OMPI_COMM_WORLD_LOCAL_RANK".

#define JACOBI_MAX_LOOPS   1000

This is the Jacobi iteration count limit. The Jacobi run will never cycle more than this, even if it has not converged when finishing the last allowed iteration.

#define JACOBI_TOLERANCE   1.0E-5F

This is the Jacobi tolerance threshold. The run is considered to have converged when the maximum residue falls below this value.

#define MIN_DOM_SIZE   3

This is the minimum acceptable domain size in any of the 2 dimensions.

#define MPI_MASTER_RANK   0

This is the global rank of the root (master) process; in this application, it is mostly relevant for printing purposes.

#define STATUS_ERR   -1

This is the status value that indicates an error.

#define STATUS_OK   0

This is the status value that indicates a successful operation.

#define USE_FLOAT   0

Setting this to 1 makes the application use only single-precision floating-point data. Set this to 0 in order to use double-precision floating-point data instead.


Function Documentation

int ApplyTopology ( int *  rank,
int  size,
const int2 *  topSize,
int *  neighbors,
int2 *  topIndex,
MPI_Comm *  cartComm 
)

Generates the 2D topology and establishes the neighbor relationships between MPI processes.

Parameters:
[in,out]rankThe rank of the calling MPI process
[in]sizeThe total number of MPI processes available
[in]topSizeThe desired topology size (this must match the number of available MPI processes)
[out]neighborsThe list that will be populated with the direct neighbors of the calling MPI process
[out]topIndexThe 2D index that the calling MPI process will have in the topology
[out]cartCommThe carthesian MPI communicator

Here is the call graph for this function:

real CallJacobiKernel ( real *  devBlocks[2],
real *  devResidue,
const int4 *  bounds,
const int2 *  size 
)

The host wrapper for one Jacobi iteration.

Parameters:
[in,out]devBlocksThe 2 blocks involved: the first is the current one, the second is the one to be updated
[out]devResidueThe global residue that is to be updated through the Jacobi iteration
[in]boundsThe bounds of the rectangular block region that holds only computable values
[in]sizeThe 2D size of data blocks, excluding the edges which hold the halo values

Here is the call graph for this function:

void CheckCudaCall ( cudaError_t  command,
const char *  commandName,
const char *  fileName,
int  line 
)

The host function for checking the result of a CUDA API call.

Parameters:
[in]commandThe result of the previously-issued CUDA API call
[in]commandNameThe name of the issued API call
[in]fileNameThe name of the file where the API call occurred
[in]lineThe line in the file where the API call occurred
void CopyDevHalosToBlock ( real *  devBlock,
const real *  devHaloLineLeft,
const real *  devHaloLineRight,
const int2 *  size,
const int *  neighbors 
)

The host wrapper for copying (unpacking) the values from the halo buffers to the left and right side of the data block.

Parameters:
[out]devBlockThe 2D device block that will contain the halo values after unpacking
[in]devHaloLineLeftThe halo buffer for the left side of the data block
[in]devHaloLineRightThe halo buffer for the right side of the data block
[in]sizeThe 2D size of data block, excluding the edges which hold the halo values
[in]neighborsThe ranks of the neighboring MPI processes

Here is the call graph for this function:

void CopyDeviceBlock ( real *  devBlocks[2],
const int4 *  bounds,
const int2 *  size 
)

The host wrapper for copying the updated block over the old one, after a Jacobi iteration finishes.

Parameters:
[in,out]devBlocksThe 2 blocks involved: the first is the old one, the second is the updated one
[in]boundsThe bounds of the rectangular updated region (holding only computable values)
[in]sizeThe 2D size of data blocks, excluding the edges which hold the halo values

Here is the call graph for this function:

void CopyDevSideEdgesFromBlock ( const real *  devBlock,
real *  devSideEdges[2],
const int2 *  size,
const int *  neighbors,
cudaStream_t  copyStream 
)

The host wrapper for copying (packing) the values on the left and right side of the data block to separate, contiguous buffers.

Parameters:
[in]devBlockThe 2D device block containing the updated values after a Jacobi run
[out]devSideEdgesThe buffers where the edge values will be packed in
[in]sizeThe 2D size of data block, excluding the edges which hold the halo values for the next iteration
[in]neighborsThe ranks of the neighboring MPI processes
[in]copyStreamThe stream on which this kernel will be executed

Here is the call graph for this function:

void ExchangeHalos ( MPI_Comm  cartComm,
real *  devSend,
real *  hostSend,
real *  hostRecv,
real *  devRecv,
int  neighbor,
int  elemCount 
)

Exchange halo values between 2 direct neighbors This is the main difference between the normal CUDA & MPI version and the CUDA-aware MPI version. In the former, the exchange first requires a copy from device to host memory, an MPI call using the host buffer and lastly, a copy of the received host buffer back to the device memory. In the latter, the host buffers are completely skipped, as the MPI environment uses the device buffers directly.

Parameters:
[in]cartCommThe carthesian MPI communicator
[in]devSendThe device buffer that needs to be sent
[in]hostSendThe host buffer where the device buffer is first copied to (not needed here)
[in]hostRecvThe host buffer that receives the halo values (not needed here)
[in]devRecvThe device buffer that receives the halo buffers directly
[in]neighborThe rank of the neighbor MPI process in the carthesian communicator
[in]elemCountThe number of elements to transfer
[in]cartCommThe carthesian MPI communicator
[in]devSendThe device buffer that needs to be sent
[in]hostSendThe host buffer where the device buffer is first copied to
[in]hostRecvThe host buffer that receives the halo values directly
[in]devRecvThe device buffer where the receiving host buffer is copied to
[in]neighborThe rank of the neighbor MPI process in the carthesian communicator
[in]elemCountThe number of elements to transfer
void Finalize ( real *  devBlocks[2],
real *  devSideEdges[2],
real *  devHaloLines[2],
real *  hostSendLines[2],
real *  hostRecvLines[2],
real *  devResidue,
cudaStream_t  copyStream 
)

Close (finalize) the MPI environment and deallocate buffers.

Parameters:
[in]devBlocksThe 2 device blocks that were used during the Jacobi iterations
[in]devSideEdgesThe 2 device side edges that were used to hold updated halos before sending
[in]devHaloLinesThe 2 device lines that were used to hold received halos
[in]hostSendLinesThe 2 host send buffers that were used at halo exchange in the normal CUDA & MPI version
[in]hostRecvLinesThe 2 host receive buffers that were used at halo exchange in the normal CUDA & MPI version
[in]devResidueThe global residue, kept in device memory
[in]copyStreamThe stream used to overlap top & bottom halo exchange with side halo copy to host memory
void Initialize ( int *  argc,
char ***  argv,
int *  rank,
int *  size 
)

Initialize the MPI environment, allowing the CUDA device to be selected before (if necessary)

Parameters:
[in,out]argcThe number of command-line arguments
[in,out]argvThe list of command-line arguments
[out]rankThe global rank of the current MPI process
[out]sizeThe total number of MPI processes available

Here is the call graph for this function:

void InitializeDataChunk ( int  topSizeY,
int  topIdxY,
const int2 *  domSize,
const int *  neighbors,
cudaStream_t *  copyStream,
real *  devBlocks[2],
real *  devSideEdges[2],
real *  devHaloLines[2],
real *  hostSendLines[2],
real *  hostRecvLines[2],
real **  devResidue 
)

This allocates and initializes all the relevant data buffers before the Jacobi run.

Parameters:
[in]topSizeYThe size of the topology in the Y direction
[in]topIdxYThe Y index of the calling MPI process in the topology
[in]domSizeThe size of the local domain (for which only the current MPI process is responsible)
[in]neighborsThe neighbor ranks, according to the topology
[in]copyStreamThe stream used to overlap top & bottom halo exchange with side halo copy to host memory
[out]devBlocksThe 2 device blocks that will be updated during the Jacobi run
[out]devSideEdgesThe 2 side edges (parallel to the Y direction) that will hold the packed halo values before sending them
[out]devHaloLinesThe 2 halo lines (parallel to the Y direction) that will hold the packed halo values after receiving them
[out]hostSendLinesThe 2 host send buffers that will be used during the halo exchange by the normal CUDA & MPI version
[out]hostRecvLinesThe 2 host receive buffers that will be used during the halo exchange by the normal CUDA & MPI version
[out]devResidueThe global device residue, which will be updated after every Jacobi iteration
int ParseCommandLineArguments ( int  argc,
char **  argv,
int  rank,
int  size,
int2 *  domSize,
int2 *  topSize,
int *  useFastSwap 
)

Parses the application's command-line arguments.

Parameters:
[in]argcThe number of input arguments
[in]argvThe input arguments
[in]rankThe MPI rank of the calling process
[in]sizeThe total number of MPI processes available
[out]domSizeThe parsed domain size (2D)
[out]topSizeThe parsed topology size (2D)
[out]useFastSwapThe parsed flag for fast block swap
Returns:
The parsing status (STATUS_OK indicates a successful parse)
void PostRunJacobi ( MPI_Comm  cartComm,
int  rank,
int  size,
const int2 *  topSize,
const int2 *  domSize,
int  iterations,
int  useFastSwap,
double  timerStart,
double  avgTransferTime 
)

This function is called immediately after the main Jacobi loop.

Parameters:
[in]cartCommThe carthesian communicator
[in]rankThe rank of the calling MPI process
[in]topSizeThe size of the topology
[in]domSizeThe size of the local domain
[in]iterationsThe number of successfully completed Jacobi iterations
[in]useFastSwapThe flag indicating if fast pointer swapping was used to exchange blocks
[in]timerStartThe Jacobi loop starting moment (measured as wall-time)
[in]avgTransferTimeThe average time spent performing MPI transfers (per process)
void PreRunJacobi ( MPI_Comm  cartComm,
int  rank,
int  size,
double *  timerStart 
)

This function is called immediately before the main Jacobi loop.

Parameters:
[in]cartCommThe carthesian communicator
[in]rankThe rank of the calling MPI process
[in]sizeThe total number of MPI processes available
[out]timerStartThe Jacobi loop starting moment (measured as wall-time)
void RunJacobi ( MPI_Comm  cartComm,
int  rank,
int  size,
const int2 *  domSize,
const int2 *  topIndex,
const int *  neighbors,
int  useFastSwap,
real *  devBlocks[2],
real *  devSideEdges[2],
real *  devHaloLines[2],
real *  hostSendLines[2],
real *  hostRecvLines[2],
real *  devResidue,
cudaStream_t  copyStream,
int *  iterations,
double *  avgTransferTime 
)

This is the main Jacobi loop, which handles device computation and data exchange between MPI processes.

Parameters:
[in]cartCommThe carthesian MPI communicator
[in]rankThe rank of the calling MPI process
[in]sizeThe number of available MPI processes
[in]domSizeThe 2D size of the local domain
[in]topIndexThe 2D index of the calling MPI process in the topology
[in]neighborsThe list of ranks which are direct neighbors to the caller
[in]useFastSwapThis flag indicates if blocks should be swapped through pointer copy (faster) or through element-by-element copy (slower)
[in,out]devBlocksThe 2 device blocks that are updated during the Jacobi run
[in,out]devSideEdgesThe 2 side edges (parallel to the Y direction) that hold the packed halo values before sending them
[in,out]devHaloLinesThe 2 halo lines (parallel to the Y direction) that hold the packed halo values after receiving them
[in,out]hostSendLinesThe 2 host send buffers that are used during the halo exchange by the normal CUDA & MPI version
[in,out]hostRecvLinesThe 2 host receive buffers that are used during the halo exchange by the normal CUDA & MPI version
[in,out]devResidueThe global device residue, which gets updated after every Jacobi iteration
[in]copyStreamThe stream used to overlap top & bottom halo exchange with side halo copy to host memory
[out]iterationsThe number of successfully completed iterations
[out]avgTransferTimeThe average time spent performing MPI transfers (per process)

Here is the call graph for this function:

void SetDeviceAfterInit ( int  rank)

This allows the MPI process to set the CUDA device after the MPI environment is initialized For the CUDA-aware MPI version, there is nothing to be done here.

Parameters:
[in]rankThe global rank of the calling MPI process

This allows the MPI process to set the CUDA device after the MPI environment is initialized For the CUDA-aware MPI version, there is nothing to be done here.

Parameters:
[in]rankThe global rank of the calling MPI process
 All Files Functions Defines