W Randolph Franklin home page
... (old version)
DanBenedetti/ home page Login


Introduction

CUSP is an extension to Thrust designed specifically for sparse linear algebra. For this presentation, I will go through a few small examples to show the various features of CUSP and how CUSP is used. I will then show code from my research that utilizes CUSP extensively. Lastly, I will discuss several of the issues I've encountered with CUSP.


SampleCode

Attach:cusp_demo.zip

Dense Matrices

1-dense.cu

In this demo, I show how to build 1D and 2D dense matrices in CUSP, and perform some simple operations on these matrices. CUSP builds dense matrices as wrappers to Thrust vectors. In the case of 2D matrices, Thrust operations utilize the 1D index according to whether the matrix is row major or column major (which can be specified on creation). As such, any Thrust operation can be used on CUSP matrices.

Covered in this example:

  • array1d
  • array2d
  • multiply
  • transpose
  • print


Sparse Matrices

2-sparse.cu

In this demo, I show how to construct the same 5x5 sparse matrix using all five of CUSP's different sparse matrix formats (COO, CSR, DIA, ELL, and HYB). I also show how to convert between the various formats, and to transfer between device and host. Lastly, I show how to write a matrix to a file using the Matrix Market Exchange format.

Coordinates (COO) Format - For each nonzero element in the matrix, specify the row index, column index, and value. This is perhaps the simplest format to understand, and the easiest to use. It also can be used for any general form of matrix construction. It does, however, suffer from the worst performance when used in computations.

Compressed Sparse Rows (CSR) Format - For each row, specify an offset. This offset states the number of nonzero matrix elements that had occurred in all previous rows. Then, for each element, specify just the column index and value. CSR is another format that can be used for any general type of sparse matrix construction. I find this format to be the least intuitive, but it does have slightly better performance than COO matrices.

Diagonal (DIA) Format - Specify the number of diagonals that have nonzero values on them. Then specify offsets from the primary matrix diagonal for each of these diagonals. Lastly, for each diagonal, specify the row and value of each nonzero element on the diagonal. This format would best be utilized by banded systems, and enjoys considerable performance improvement over the COO format.

ELLPACK (ELL) Format - ELLPACK is a package that is used to solve elliptic boundary value problems. For the ELL format, the maximum number of nonzero elements in each row is specified. Then, for each row, the column index and value is specified for each nonzero. For rows with fewer than the maximum number of nonzero elements, an invalid index is specified. This format works best for matrices that have a similar number of elements in each row, and does quite well in terms of performance.

Hybrid (HYB) Format - This format allows for specifying matrices using both the ELL format and the COO format. The general approach would then be to pick the standard number of elements in each row, and specify all of those elements using the ELL format. You would then use the COO format to specify any extra elements not covered by the ELL format.

Conversion between formats occurs automatically in CUSP. Likewise can be said for the transfer from device to host. Simply constructing a matrix as a copy of another matrix will automatically perform format conversion and device/host transfer if necessary.

Covered in this example:

  • coo_matrix
  • csr_matrix
  • dia_matrix
  • ell_matrix
  • hyb_matrix
  • matrix_market


Numerical System Solver

3-solver.cu

In this demo, I show how to use a CUSP system solver. All solvers perform the functionality of approximating the solution to Ax=b, by finding x without performing an inverse on A. These solvers, all of the Krylov subspace, function on both dense or sparse A matrices. There is, however, a requirement that the A matrix be square. Setup includes the specification of a monitor and a preconditioner. The monitor is used to track the convergence of the system, and takes a maximum number of iterations and desired error threshold as arguments. The preconditioner is a means to adjust the A matrix and b vector such that the system solver will converge in fewer iterations.

I won't go into too much detail discussing the various system solvers and preconditioners provided by CUSP. I do have a few small notes on convergence and performance, however. The Conjugate Gradient method (CG) has trouble converging for matrices that are poorly conditioned. It is a simple technique, but the other techniques are either more reliable or perform better. The Biconjugate Gradient Stablized method (BiCGStab) offers a very strong chance of converging, but often takes a larger number of iterations to converge than CG. The Generalized Minimal Residual method (GMRES) performed best in my personal tests, and managed to converge even for my ill-conditioned A matrix. All of the preconditioners are quite memory intensive. While they do improve performance considerably in some cases, it is worth considering using a simple Identity matrix as the preconditioner in order to conserve memory on the GPU. CUSP does provide some flowcharts to aid in the selection of both solver and preconditioner, as provided below.

Covered in this example:

  • krylov subspace solvers
  • preconditioners
  • monitors


Solver Selection Flowchart


Preconditioner Selection Flowchart


Research - ODETLAP

For my research with Professor Franklin, I am compressing high dimensional data using the ODETLAP method. Without going into too much detail, ODETLAP requires the construction of a large, overdetermined system of equations. For each point in a data set, an equation is generated such that the point is the average of each of its immediately adjacent neighbors. Additional points are selected as known points, such that equations are added where that point is equal to its value from the original data set. The solution to this overdetermined system is then approximated using a numerical solver.

CUSP is utilized as the method for approximating the system solution. Thrust is utilized for the construction of the A matrix and b vector. Due to the overdetermined nature of ODETLAP, the A matrix is not square. In order to then use the CUSP system solvers, the A matrix must be normalized. That is, we instead approximate the solution to A'Ax = A'b.

ODETLAP is an iterative method. On each iteration, additional known points are added to the system. This means that A and b are resized to accommodate the new equations. The system solver is then invoked again for each ODETLAP iteration.

Implementation Specifics:

  • The A matrix is built using the COO format
  • The b and x vectors are array1d matrices
  • The CUSP multiply and transform methods are used to get A'Ax = A'b
  • GMRES with restart value 20 is used as the system solver
  • An identity matrix is used as the preconditioner in order to conserve memory
  • A default monitor is used, with a max of 100 iterations and a tolerance of 1e-6


Conclusion

CUSP fulfilled the needs for my research, which was my primary reason for using it. I would, however, suggest proceeding with caution if you plan to use it. CUSP still has not had a stable release after several years of development. It is currently on version 0.4, and code that I wrote less than one year ago using version 0.2 no longer works. Their idea of deprecation is telling you that the function is deprecated as part of the error message when the code does not compile. Certain functionality will not work in all scenarios. For example, the code from the dense demo works on the host, but not on the device. I've never managed to get any function from the cusp/blas.h library to work on either host or device, using dense or sparse matrices.

I would suggest considering a CUSP alternative if you are interested in utilizing sparse matrix computation on the GPU. There is a commercial competitor known as CULA Tools (student trial available for free). I tried it briefly, and from what I have seen it works reasonably well. Alternatively, you could consider using the lower level cuSPARSE library with CUDA, which both CUSP and CULA Tools utilize for their implementations.