Sparse BLAS

- Level 1: sparse dot product, vector update, and gather/scatter;
- Level 2: sparse matrix-vector multiply and triangular solve;
- Level 3: sparse matrix-dense matrix multiply and triangular solve with multiple right-hand sides.

These are the basic operations used in most of the iterative algorithms
for solving sparse linear equations and eigensystems.
The Level 2 and 3 sparse BLAS interfaces support nine sparse matrix

storage formats. The nine formats are divided into two categories:
the *point entry* formats such as compressed row and the *block entry*
formats such as block compressed row. In the block entry formats, the
sparsity structure is represented as a series of small dense blocks.
Note that the nine formats include
some surveyed in §10.1,
except JDS and SKS.

The interface design of the sparse BLAS is fundamentally
different from the dense BLAS.
Since there is no single ``best'' storage format for any sparse matrix
operation, the sparse matrix arguments to the Level 2 and 3 sparse BLAS
do not use one particular storage format. Instead, a *generic interface*
is defined for these routines, in which a sparse matrix argument is
a *handle* (integer) representing the matrix. Before calling
a sparse BLAS routine, one needs to call a creation routine to create
the sparse matrix handle from one of the nine formats.
After calling the sparse BLAS routine, one needs to call a cleanup
routine to release the resources associated with the matrix handle.
This handle-based approach allows
one to use the sparse BLAS independently of the sparse matrix storage.
The internal representation is implementation dependent and can be chosen
for best performance.

Since matrix-vector products often account for most of the time spent in iterative methods, several research efforts have tried to optimize their performance [438,439,240,239]. In matrix-vector multiplication, each matrix entry participates in only one operation, but each vector entry may participate in more than one operation. A primary goal of optimization is to reduce the amount of movement of the source vector between different levels of memory. The optimization techniques include matrix reordering, cache blocking, and register blocking. A recently developed toolbox, called SPARSITY, embraces all these techniques [240,239]. Depending on the matrix structure, the authors have observed up to threefold speedup even on uniprocessors. SPARSITY also contains mechanisms to automatically choose the best block sizes based on matrix and machine characteristics.

In many of the iterative methods discussed earlier, both the product
of a matrix and that of its transpose times a vector are needed; that
is, given an input vector we want to compute products

In the following two subsections, we will present algorithms for the storage format from §10.1, CRS.