The header containing the most relevant functions for the Jacobi solver. More...

#include <mpi.h>
#include <stdio.h>
#include <stdlib.h>
#include <cuda_runtime.h>

Include dependency graph for Jacobi.h:

This graph shows which files directly or indirectly include this file:

Defines
#define	USE_FLOAT 0
#define	DEFAULT_DOMAIN_SIZE 4096
#define	MIN_DOM_SIZE 3
#define	ENV_LOCAL_RANK "MV2_COMM_WORLD_LOCAL_RANK"
#define	MPI_MASTER_RANK 0
#define	JACOBI_TOLERANCE 1.0E-5F
#define	JACOBI_MAX_LOOPS 1000
#define	STATUS_OK 0
#define	STATUS_ERR -1
Functions
void	Initialize (int argc, char *argv, int rank, int *size)
	Initialize the MPI environment, allowing the CUDA device to be selected before (if necessary)
void	Finalize (real devBlocks[2], real devSideEdges[2], real devHaloLines[2], real hostSendLines[2], real hostRecvLines[2], real devResidue, cudaStream_t copyStream)
	Close (finalize) the MPI environment and deallocate buffers.
int	ParseCommandLineArguments (int argc, char *argv, int rank, int size, int2 domSize, int2 topSize, int useFastSwap)
	Parses the application's command-line arguments.
int	ApplyTopology (int rank, int size, const int2 topSize, int neighbors, int2 topIndex, MPI_Comm *cartComm)
	Generates the 2D topology and establishes the neighbor relationships between MPI processes.
void	InitializeDataChunk (int topSizeY, int topIdxY, const int2 domSize, const int neighbors, cudaStream_t copyStream, real devBlocks[2], real devSideEdges[2], real devHaloLines[2], real hostSendLines[2], real hostRecvLines[2], real **devResidue)
	This allocates and initializes all the relevant data buffers before the Jacobi run.
void	PreRunJacobi (MPI_Comm cartComm, int rank, int size, double *timerStart)
	This function is called immediately before the main Jacobi loop.
void	RunJacobi (MPI_Comm cartComm, int rank, int size, const int2 domSize, const int2 topIndex, const int neighbors, int useFastSwap, real devBlocks[2], real devSideEdges[2], real devHaloLines[2], real hostSendLines[2], real hostRecvLines[2], real devResidue, cudaStream_t copyStream, int iterations, double *avgTransferTime)
	This is the main Jacobi loop, which handles device computation and data exchange between MPI processes.
void	PostRunJacobi (MPI_Comm cartComm, int rank, int size, const int2 topSize, const int2 domSize, int iterations, int useFastSwap, double timerStart, double avgTransferTime)
	This function is called immediately after the main Jacobi loop.
void	SetDeviceBeforeInit ()
	This allows the MPI process to set the CUDA device before the MPI environment is initialized For the CUDA-aware MPI version, the is the only place where the device gets set. In order to do this, we rely on the node's local rank, as the MPI environment has not been initialized yet.
void	SetDeviceAfterInit (int rank)
	This allows the MPI process to set the CUDA device after the MPI environment is initialized For the CUDA-aware MPI version, there is nothing to be done here.
void	ExchangeHalos (MPI_Comm cartComm, real devSend, real hostSend, real hostRecv, real devRecv, int neighbor, int elemCount)
	Exchange halo values between 2 direct neighbors This is the main difference between the normal CUDA & MPI version and the CUDA-aware MPI version. In the former, the exchange first requires a copy from device to host memory, an MPI call using the host buffer and lastly, a copy of the received host buffer back to the device memory. In the latter, the host buffers are completely skipped, as the MPI environment uses the device buffers directly.
void	CheckCudaCall (cudaError_t command, const char commandName, const char fileName, int line)
	The host function for checking the result of a CUDA API call.
real	CallJacobiKernel (real devBlocks[2], real devResidue, const int4 bounds, const int2 size)
	The host wrapper for one Jacobi iteration.
void	CopyDeviceBlock (real devBlocks[2], const int4 bounds, const int2 *size)
	The host wrapper for copying the updated block over the old one, after a Jacobi iteration finishes.
void	CopyDevHalosToBlock (real devBlock, const real devHaloLineLeft, const real devHaloLineRight, const int2 size, const int *neighbors)
	The host wrapper for copying (unpacking) the values from the halo buffers to the left and right side of the data block.
void	CopyDevSideEdgesFromBlock (const real devBlock, real devSideEdges[2], const int2 size, const int neighbors, cudaStream_t copyStream)
	The host wrapper for copying (packing) the values on the left and right side of the data block to separate, contiguous buffers.

Detailed Description

The header containing the most relevant functions for the Jacobi solver.

Define Documentation

#define DEFAULT_DOMAIN_SIZE 4096

This is the default domain size (when not explicitly stated with "-d" in the command-line arguments).

#define ENV_LOCAL_RANK "MV2_COMM_WORLD_LOCAL_RANK"

This is the environment variable which allows the reading of the local rank of the current MPI process before the MPI environment gets initialized with MPI_Init(). This is necessary when running the CUDA-aware MPI version of the Jacobi solver, which needs this information in order to be able to set the CUDA device for the MPI process before MPI environment initialization. If you are using MVAPICH2, set this constant to "MV2_COMM_WORLD_LOCAL_RANK"; for Open MPI, use "OMPI_COMM_WORLD_LOCAL_RANK".

#define JACOBI_MAX_LOOPS 1000

This is the Jacobi iteration count limit. The Jacobi run will never cycle more than this, even if it has not converged when finishing the last allowed iteration.

#define JACOBI_TOLERANCE 1.0E-5F

This is the Jacobi tolerance threshold. The run is considered to have converged when the maximum residue falls below this value.

#define MIN_DOM_SIZE 3

This is the minimum acceptable domain size in any of the 2 dimensions.

#define MPI_MASTER_RANK 0

This is the global rank of the root (master) process; in this application, it is mostly relevant for printing purposes.

#define STATUS_ERR -1

This is the status value that indicates an error.

#define STATUS_OK 0

This is the status value that indicates a successful operation.

#define USE_FLOAT 0

Setting this to 1 makes the application use only single-precision floating-point data. Set this to 0 in order to use double-precision floating-point data instead.

Function Documentation

int ApplyTopology	(	int *	rank,
		int	size,
		const int2 *	topSize,
		int *	neighbors,
		int2 *	topIndex,
		MPI_Comm *	cartComm
	)

Generates the 2D topology and establishes the neighbor relationships between MPI processes.

Parameters:

[in,out]	rank	The rank of the calling MPI process
[in]	size	The total number of MPI processes available
[in]	topSize	The desired topology size (this must match the number of available MPI processes)
[out]	neighbors	The list that will be populated with the direct neighbors of the calling MPI process
[out]	topIndex	The 2D index that the calling MPI process will have in the topology
[out]	cartComm	The carthesian MPI communicator

Here is the call graph for this function:

real CallJacobiKernel	(	real *	devBlocks[2],
		real *	devResidue,
		const int4 *	bounds,
		const int2 *	size
	)

The host wrapper for one Jacobi iteration.

Parameters:

[in,out]	devBlocks	The 2 blocks involved: the first is the current one, the second is the one to be updated
[out]	devResidue	The global residue that is to be updated through the Jacobi iteration
[in]	bounds	The bounds of the rectangular block region that holds only computable values
[in]	size	The 2D size of data blocks, excluding the edges which hold the halo values

Here is the call graph for this function:

void CheckCudaCall	(	cudaError_t	command,
		const char *	commandName,
		const char *	fileName,
		int	line
	)

The host function for checking the result of a CUDA API call.

Parameters:

[in]	command	The result of the previously-issued CUDA API call
[in]	commandName	The name of the issued API call
[in]	fileName	The name of the file where the API call occurred
[in]	line	The line in the file where the API call occurred

void CopyDevHalosToBlock	(	real *	devBlock,
		const real *	devHaloLineLeft,
		const real *	devHaloLineRight,
		const int2 *	size,
		const int *	neighbors
	)

The host wrapper for copying (unpacking) the values from the halo buffers to the left and right side of the data block.

Parameters:

[out]	devBlock	The 2D device block that will contain the halo values after unpacking
[in]	devHaloLineLeft	The halo buffer for the left side of the data block
[in]	devHaloLineRight	The halo buffer for the right side of the data block
[in]	size	The 2D size of data block, excluding the edges which hold the halo values
[in]	neighbors	The ranks of the neighboring MPI processes

Here is the call graph for this function:

void CopyDeviceBlock	(	real *	devBlocks[2],
		const int4 *	bounds,
		const int2 *	size
	)

The host wrapper for copying the updated block over the old one, after a Jacobi iteration finishes.

Parameters:

[in,out]	devBlocks	The 2 blocks involved: the first is the old one, the second is the updated one
[in]	bounds	The bounds of the rectangular updated region (holding only computable values)
[in]	size	The 2D size of data blocks, excluding the edges which hold the halo values

Here is the call graph for this function:

void CopyDevSideEdgesFromBlock	(	const real *	devBlock,
		real *	devSideEdges[2],
		const int2 *	size,
		const int *	neighbors,
		cudaStream_t	copyStream
	)

The host wrapper for copying (packing) the values on the left and right side of the data block to separate, contiguous buffers.

Parameters:

[in]	devBlock	The 2D device block containing the updated values after a Jacobi run
[out]	devSideEdges	The buffers where the edge values will be packed in
[in]	size	The 2D size of data block, excluding the edges which hold the halo values for the next iteration
[in]	neighbors	The ranks of the neighboring MPI processes
[in]	copyStream	The stream on which this kernel will be executed

Here is the call graph for this function:

void ExchangeHalos	(	MPI_Comm	cartComm,
		real *	devSend,
		real *	hostSend,
		real *	hostRecv,
		real *	devRecv,
		int	neighbor,
		int	elemCount
	)

Exchange halo values between 2 direct neighbors This is the main difference between the normal CUDA & MPI version and the CUDA-aware MPI version. In the former, the exchange first requires a copy from device to host memory, an MPI call using the host buffer and lastly, a copy of the received host buffer back to the device memory. In the latter, the host buffers are completely skipped, as the MPI environment uses the device buffers directly.

Parameters:

[in]	cartComm	The carthesian MPI communicator
[in]	devSend	The device buffer that needs to be sent
[in]	hostSend	The host buffer where the device buffer is first copied to (not needed here)
[in]	hostRecv	The host buffer that receives the halo values (not needed here)
[in]	devRecv	The device buffer that receives the halo buffers directly
[in]	neighbor	The rank of the neighbor MPI process in the carthesian communicator
[in]	elemCount	The number of elements to transfer
[in]	cartComm	The carthesian MPI communicator
[in]	devSend	The device buffer that needs to be sent
[in]	hostSend	The host buffer where the device buffer is first copied to
[in]	hostRecv	The host buffer that receives the halo values directly
[in]	devRecv	The device buffer where the receiving host buffer is copied to
[in]	neighbor	The rank of the neighbor MPI process in the carthesian communicator
[in]	elemCount	The number of elements to transfer

void Finalize	(	real *	devBlocks[2],
		real *	devSideEdges[2],
		real *	devHaloLines[2],
		real *	hostSendLines[2],
		real *	hostRecvLines[2],
		real *	devResidue,
		cudaStream_t	copyStream
	)

Close (finalize) the MPI environment and deallocate buffers.

Parameters:

[in]	devBlocks	The 2 device blocks that were used during the Jacobi iterations
[in]	devSideEdges	The 2 device side edges that were used to hold updated halos before sending
[in]	devHaloLines	The 2 device lines that were used to hold received halos
[in]	hostSendLines	The 2 host send buffers that were used at halo exchange in the normal CUDA & MPI version
[in]	hostRecvLines	The 2 host receive buffers that were used at halo exchange in the normal CUDA & MPI version
[in]	devResidue	The global residue, kept in device memory
[in]	copyStream	The stream used to overlap top & bottom halo exchange with side halo copy to host memory

void Initialize	(	int *	argc,
		char ***	argv,
		int *	rank,
		int *	size
	)

Initialize the MPI environment, allowing the CUDA device to be selected before (if necessary)

Parameters:

[in,out]	argc	The number of command-line arguments
[in,out]	argv	The list of command-line arguments
[out]	rank	The global rank of the current MPI process
[out]	size	The total number of MPI processes available

Here is the call graph for this function:

void InitializeDataChunk	(	int	topSizeY,
		int	topIdxY,
		const int2 *	domSize,
		const int *	neighbors,
		cudaStream_t *	copyStream,
		real *	devBlocks[2],
		real *	devSideEdges[2],
		real *	devHaloLines[2],
		real *	hostSendLines[2],
		real *	hostRecvLines[2],
		real **	devResidue
	)

This allocates and initializes all the relevant data buffers before the Jacobi run.

Parameters:

[in]	topSizeY	The size of the topology in the Y direction
[in]	topIdxY	The Y index of the calling MPI process in the topology
[in]	domSize	The size of the local domain (for which only the current MPI process is responsible)
[in]	neighbors	The neighbor ranks, according to the topology
[in]	copyStream	The stream used to overlap top & bottom halo exchange with side halo copy to host memory
[out]	devBlocks	The 2 device blocks that will be updated during the Jacobi run
[out]	devSideEdges	The 2 side edges (parallel to the Y direction) that will hold the packed halo values before sending them
[out]	devHaloLines	The 2 halo lines (parallel to the Y direction) that will hold the packed halo values after receiving them
[out]	hostSendLines	The 2 host send buffers that will be used during the halo exchange by the normal CUDA & MPI version
[out]	hostRecvLines	The 2 host receive buffers that will be used during the halo exchange by the normal CUDA & MPI version
[out]	devResidue	The global device residue, which will be updated after every Jacobi iteration

int ParseCommandLineArguments	(	int	argc,
		char **	argv,
		int	rank,
		int	size,
		int2 *	domSize,
		int2 *	topSize,
		int *	useFastSwap
	)

Parses the application's command-line arguments.

Parameters:

[in]	argc	The number of input arguments
[in]	argv	The input arguments
[in]	rank	The MPI rank of the calling process
[in]	size	The total number of MPI processes available
[out]	domSize	The parsed domain size (2D)
[out]	topSize	The parsed topology size (2D)
[out]	useFastSwap	The parsed flag for fast block swap

Returns:: The parsing status (STATUS_OK indicates a successful parse)

void PostRunJacobi	(	MPI_Comm	cartComm,
		int	rank,
		int	size,
		const int2 *	topSize,
		const int2 *	domSize,
		int	iterations,
		int	useFastSwap,
		double	timerStart,
		double	avgTransferTime
	)

This function is called immediately after the main Jacobi loop.

Parameters:

[in]	cartComm	The carthesian communicator
[in]	rank	The rank of the calling MPI process
[in]	topSize	The size of the topology
[in]	domSize	The size of the local domain
[in]	iterations	The number of successfully completed Jacobi iterations
[in]	useFastSwap	The flag indicating if fast pointer swapping was used to exchange blocks
[in]	timerStart	The Jacobi loop starting moment (measured as wall-time)
[in]	avgTransferTime	The average time spent performing MPI transfers (per process)

void PreRunJacobi	(	MPI_Comm	cartComm,
		int	rank,
		int	size,
		double *	timerStart
	)

This function is called immediately before the main Jacobi loop.

Parameters:

[in]	cartComm	The carthesian communicator
[in]	rank	The rank of the calling MPI process
[in]	size	The total number of MPI processes available
[out]	timerStart	The Jacobi loop starting moment (measured as wall-time)

void RunJacobi	(	MPI_Comm	cartComm,
		int	rank,
		int	size,
		const int2 *	domSize,
		const int2 *	topIndex,
		const int *	neighbors,
		int	useFastSwap,
		real *	devBlocks[2],
		real *	devSideEdges[2],
		real *	devHaloLines[2],
		real *	hostSendLines[2],
		real *	hostRecvLines[2],
		real *	devResidue,
		cudaStream_t	copyStream,
		int *	iterations,
		double *	avgTransferTime
	)

This is the main Jacobi loop, which handles device computation and data exchange between MPI processes.

Parameters:

[in]	cartComm	The carthesian MPI communicator
[in]	rank	The rank of the calling MPI process
[in]	size	The number of available MPI processes
[in]	domSize	The 2D size of the local domain
[in]	topIndex	The 2D index of the calling MPI process in the topology
[in]	neighbors	The list of ranks which are direct neighbors to the caller
[in]	useFastSwap	This flag indicates if blocks should be swapped through pointer copy (faster) or through element-by-element copy (slower)
[in,out]	devBlocks	The 2 device blocks that are updated during the Jacobi run
[in,out]	devSideEdges	The 2 side edges (parallel to the Y direction) that hold the packed halo values before sending them
[in,out]	devHaloLines	The 2 halo lines (parallel to the Y direction) that hold the packed halo values after receiving them
[in,out]	hostSendLines	The 2 host send buffers that are used during the halo exchange by the normal CUDA & MPI version
[in,out]	hostRecvLines	The 2 host receive buffers that are used during the halo exchange by the normal CUDA & MPI version
[in,out]	devResidue	The global device residue, which gets updated after every Jacobi iteration
[in]	copyStream	The stream used to overlap top & bottom halo exchange with side halo copy to host memory
[out]	iterations	The number of successfully completed iterations
[out]	avgTransferTime	The average time spent performing MPI transfers (per process)

Here is the call graph for this function:

void SetDeviceAfterInit ( int rank )

This allows the MPI process to set the CUDA device after the MPI environment is initialized For the CUDA-aware MPI version, there is nothing to be done here.

Parameters:

[in] rank The global rank of the calling MPI process

This allows the MPI process to set the CUDA device after the MPI environment is initialized For the CUDA-aware MPI version, there is nothing to be done here.

Parameters:

[in] rank The global rank of the calling MPI process

Defines

Functions

Detailed Description

Define Documentation

Function Documentation