Linear Solvers

Krylov Solvers and Preconditioning with MueLu/Trilinos

Introduction to Krylov Solvers and Preconditioning, with emphasis on Multigrid

At a Glance

Questions	Objectives	Key Points
How do we choose a suitable Krylov solver?	Know when to use CG or GMRES.	CG works for spd matrix and preconditioner. GMRES works for unsymmetric systems, but requires more memory.
How do we choose a preconditioner?	Know common sparse preconditioners.	As the size of the linear system grows, most iterative methods will require increasing number of iterations.
How can we improve efficiency of the solver?	Understand the basic components of multigrid.	For certain common problem classes, multigrid methods require a constant number of iterations and constant work per unknown.

To begin this lesson

Make sure that you have followed the setup instructions.

Go to the directory for the Krylov application

cd HandsOnLessons/krylov_amg_muelu

The Problem Being Solved

The Poisson equation arises in electrostatics, incompressible fluid flow simulations, and numerous other applications. We will consider the Poisson equation \[-\Delta u = f\]

on a square mesh of size \(n_x \times n_y\) with Dirichlet boundary conditions \(u = 0\).

It is discretized using central finite differences, leading to a symmetric positive (spd) matrix.

The Example Source Code

For this lesson, we will be using the executable MueLu_Stratimikos.exe from the MueLu package of Trilinos which allows us to test a variety of solvers and preconditioners.

For the first part of the lesson, we will be running on a single MPI rank, so no need to ask for a massive allocation.

The executable takes several command line arguments that influence the linear system that is generated on the fly or read in from file. A complete set of options will be printed by typing

./MueLu_Stratimikos.exe --help

The most important ones are:

Usage: ./MueLu_Stratimikos.exe [options]
  options:
  --help                               Prints this help message
  --nx                   int           mesh points in x-direction.
                                       (default: --nx=100)
  --ny                   int           mesh points in y-direction.
                                       (default: --ny=100)
  --nz                   int           mesh points in z-direction.
                                       (default: --nz=100)
  --matrixType           string        matrix type: Laplace1D, Laplace2D, Laplace3D, ...
                                       (default: --matrixType="Laplace2D")
  --xml                  string        read parameters from an xml file
                                       (default: --xml="stratimikos_ParameterList.xml")
  --yaml                 string        read parameters from a yaml file
                                       (default: --yaml="")

Solvers (such as CG and GMRES) and preconditioners (such as Jacobi, Gauss-Seidel and multigrid) are configured via parameter files.

Trilinos supports both XML and YAML parameter files.

In order to make following along as simple as possible and avoid typos, parameter XML files for all examples can be found in the lesson folder. If you want to see the difference between to input files, run diff a.xml b.xml.

Running the Example

Set 1 - Krylov solver, no preconditioner

The default Krylov method is GMRES, and no preconditioner is used.

Run

./MueLu_Stratimikos.exe --xml=set1-gmres.xml

Expected Behavior/Output

You should see output like this:

========================================================
Xpetra::Parameters
 Linear algebra library: Tpetra
Galeri::Xpetra::Parameters<int>
 Matrix type: Laplace2D
 Problem size: 10000 (100x100)
Processor subdomains in x direction: 1
Processor subdomains in y direction: 1
Processor subdomains in z direction: -1
========================================================
Galeri complete.
========================================================

  *******************************************************
  ***** Belos Iterative Solver:  Pseudo Block Gmres
  ***** Maximum Iterations: 100
  ***** Block Size: 1
  ***** Residual Test:
  *****   Test 1 : Belos::StatusTestImpResNorm<>: (2-Norm Res Vec) / (2-Norm Prec Res0), tol = 1e-08
  *******************************************************
  Iter   0, [ 1] :    1.000000e+00
  Iter  10, [ 1] :    2.934358e-03
  Iter  20, [ 1] :    4.605673e-04
  Iter  30, [ 1] :    1.647752e-04
  Iter  40, [ 1] :    7.453272e-05
  Iter  50, [ 1] :    4.822835e-05
  Iter  60, [ 1] :    2.768790e-05
  Iter  70, [ 1] :    1.712668e-05
  Iter  80, [ 1] :    1.194621e-05
  Iter  90, [ 1] :    9.782121e-06
  Iter 100, [ 1] :    7.221948e-06
  Passed.......OR Combination ->
    Failed.......Number of Iterations = 100 == 100
    Unconverged..(2-Norm Res Vec) / (2-Norm Prec Res0)
                 residual [ 0 ] = 7.22195e-06 > 1e-08

Examining Results

We observe the following:

We constructed a 2D Laplace problem on a square mesh of 100 x 100 nodes, resulting in a linear system of 10000 unknowns.
We have set two termination criteria for the solve: 100 iterations or a reduction of the residual norm by 8 orders of magnitude.
The solve failed, since we reached 100 iterations, but only reduced the residual norm by a factor of 7e-06.

Now, modify the input file to use the conjugate gradient method. We change the Solver Type parameter to Pseudo Block CG.

Run

./MueLu_Stratimikos.exe --xml=set1-cg.xml

Q: Do you see any significant changes in convergence behavior?

No, neither solver manages to converge in less than 100 iterations.

Q: Why is it preferable to use conjugate gradients over GMRES in this case?

CG has significantly lower memory requirements, since it uses a short recurrence. GMRES, on the other hand, has to keep the entire Krylov space around.

You can check the last answer by comparing the approximate memory usage of CG and GMRES using

/usr/bin/time -v ./MueLu_Stratimikos.exe --nx=1000 --ny=1000 --xml=set1-gmres.xml 2>&1 | grep "Maximum resident set size"
/usr/bin/time -v ./MueLu_Stratimikos.exe --nx=1000 --ny=1000 --xml=set1-cg.xml    2>&1 | grep "Maximum resident set size"

We used a larger problem to be able to see the difference more clearly.

In what follows, we will be using the CG solver.

Set 2 - Krylov solver, simple preconditioners

We now explore some simple (and quite generic) options for preconditioning the problem.

So far, the Preconditioner Type parameter was set to None, meaning no preconditioning. We now use Ifpack2 instead. Ifpack2 is another Trilinos package which provides a number of different simple preconditioners.

Moreover, we add the following configuration for Ifpack2.

<ParameterList name="Ifpack2">
  <Parameter name="Prec Type" type="string" value="relaxation"/>
  <ParameterList name="Ifpack2 Settings">
    <Parameter name="relaxation: type" type="string" value="Symmetric Gauss-Seidel"/>
    <Parameter name="relaxation: sweeps" type="int" value="1"/>
  </ParameterList>
</ParameterList>

This means that a single sweep of symmetric Gauss-Seidel is used for preconditioning.

Run

./MueLu_Stratimikos.exe --xml=set2-sgs1.xml

We can strengthen the preconditioner by increasing the number of symmetric Gauss-Seidel sweeps we are using as a preconditioner. We switch relaxation: sweeps to 3.

Run

./MueLu_Stratimikos.exe --xml=set2-sgs3.xml

and verify that the number of iterations further decreased.

Now, we will check whether we have created a scalable solver strategy.

Record the number of iterations for different problem sizes by running

./MueLu_Stratimikos.exe --xml=set2-sgs3.xml --nx=50  --ny=50
./MueLu_Stratimikos.exe --xml=set2-sgs3.xml --nx=100 --ny=100
./MueLu_Stratimikos.exe --xml=set2-sgs3.xml --nx=200 --ny=200

(This means that we are running the same 2D Laplace problem as above, but on meshes of size 50x50, etc.)

Q: Is the solver scalable?

No, the number of iterations increases with the problem size.

The number of iterations taken by CG scales with the square root of the condition number \(\kappa(PA)\) of the preconditioned system, where \(P\) is the preconditioner.

Q: Based on the iterations you recorded, how does this condition number roughly scale with respect to the number of unknowns?

In each step, the number of iterations grows by a factor of 2, and the number of unknowns grows by a factor of 4. Hence the condition number is proportional to the number of unknowns.

Set 3 - Krylov solver, multigrid preconditioner

The reason that the Gauss-Seidel preconditioner did not work well is that it effectively only reduces error locally, but not globally. We hence need a global mechanism of error correction, which can be provided by adding one or more coarser grids.

We switch the Preconditioner Type parameter to MueLu, which is an algebraic multigrid package in Trilinos.

Run

./MueLu_Stratimikos.exe --xml=set3-mg-jacobi.xml --nx=50  --ny=50
./MueLu_Stratimikos.exe --xml=set3-mg-jacobi.xml --nx=100 --ny=100
./MueLu_Stratimikos.exe --xml=set3-mg-jacobi.xml --nx=200 --ny=200

Q: Is the solver scalable?

Yes, the number of iterations stays more or less constant as we increase the problem size.

Understanding information about the multigrid preconditioner

Let’’s look a little more closely at the output from the largest example.

Rerun:

./MueLu_Stratimikos.exe --xml=set3-mg-jacobi.xml --nx=200 --ny=200

The multigrid summary provides the following information:

The number of multigrid levels created, including the linear system of interest
Operator complexity
Smoother complexity
The multigrid cycle type
Matrix statistics for each level (rows, number of nonzeros, number of processors)
The smoother used on each level

The operator complexity is given by the formula \[\frac{\Sigma_0^L nnz(A_i)}{nnz(A_0)},\]

where \(A_0\) is the fine level matrix.

Q: How does the operator complexity help the user assess potential performance?

The complexity gives the ratio of nonzeros in all matrices compared to the fine level matrix. This is roughly the ratio of FLOPs required by a matrix-vector product performed on each matrix relative to the fine level matrix.

The smoother complexity is the ratio of FLOPS required by smoothers on all levels to FLOPs required by just the fine level smoother.

Q: Why might the operator complexity and smoother complexity differ?

A smoother such as an incomplete factorization will have a much higher FLOP intensity than a matrix-vector product.

Effect of different smoothers

The first adjustment that we want to make is to select different smoothers. This involves the following trade-off: choosing a more effective smoother should reduce the number of iterations, but might involve more computation.

By default, we use a single sweep of Jacobi smoothing, which is very cheap.

First, we run

./MueLu_Stratimikos.exe --xml=set3-mg-jacobi.xml --timings --nx=1000 --ny=1000

to display timing information on a large enough problem. The relevant timer to look at is Belos: PseudoBlockCGSolMgr total solve time. (You might want to run this more than once in case you are experiencing some system noise.) Since there are quite a lot of timers, you could grep for the iteration count and the timer by appending | egrep "total solve time|Number of Iterations" to the command, i.e.,

./MueLu_Stratimikos.exe --xml=set3-mg-jacobi.xml --timings --nx=1000 --ny=1000 |  egrep "total solve time|Number of Iterations"

We know that Gauss-Seidel is a better smoother than Jacobi. There are two ways of using Gauss-Seidel while keeping the preconditioner symmetric: we can either use different directions in the sweeps in pre- and post-smoothing, or use a symmetric Gauss-Seidel smoother for both.

Run

./MueLu_Stratimikos.exe --xml=set3-mg-sgs.xml --timings --nx=1000 --ny=1000
./MueLu_Stratimikos.exe --xml=set3-mg-gs.xml  --timings --nx=1000 --ny=1000

and compare the timings with the Jacobi case.

Q: Do you see an improvement?

Yes. For symmetric Gauss-Seidel, the number of iterations decreases. For forward Gauss-Seidel for pre-smoothing and backwards Gauss-Seidel for post-smoothing, both number of iterations and time-to-solution are reduced.

Now let’s see the effect of running Gauss-Seidel with increasing numbers of MPI ranks.

Run

mpiexec -np 1  ./MueLu_Stratimikos.exe --xml=set3-mg-gs.xml --timings --nx=1000 --ny=1000 |  egrep "total solve time|Number of Iterations"
mpiexec -np 12 ./MueLu_Stratimikos.exe --xml=set3-mg-gs.xml --timings --nx=1000 --ny=1000 |  egrep "total solve time|Number of Iterations"

Q: What do you observe as you add MPI ranks?

The number of iterations changes slightly, while the solution time decreases.

Q: Do you think that Gauss-Seidel is easily adaptible for use on massively parallel architectures such as GPUs?

Gauss-Seidel has limited opportunities for parallelism. Equation \(i\) cannot be solved until all equations \(j, j<i\) that \(i\) depends on have been solved.

Hint: Have a look at the Gauss-Seidel algorithm.

Another common smoother is a matrix polynomial, specifically, a Chebyshev polynomial. This type smoother has certain advantages over relaxation methods like Jacobi or Gauss-Seidel.

Chebyshev will have better convergence properties than Jacobi.
The Chebyshev computational kernel is a sparse matrix-vector multiply (SpMV), which is invariant with respect to the number of processes.
The SpMV kernel is naturally parallelizable with many high-performance implementations. There are limited opportunities for parallelism in Gauss-Seidel, e.g., coloring.

We switch the smother to Chebyshev. Repeat the above experiment.

mpiexec -np 1 ./MueLu_Stratimikos.exe  --xml=set3-mg-chebyshev.xml --timings --nx=1000 --ny=1000 |  egrep "total solve time|Number of Iterations"
mpiexec -np 12 ./MueLu_Stratimikos.exe --xml=set3-mg-chebyshev.xml --timings --nx=1000 --ny=1000 |  egrep "total solve time|Number of Iterations"

Choosing a smoother that is computationally inexpensive but with poor convergence properties can result in a large number of solver iterations. Choosing a smoother that is computationally expensive but with good convergence properties can result in a small number of solver iterations, but overall long run times.

Out-Brief

In this lesson, we have developed a scalable solver for a simple test problem, the Poisson equation.

A good choice of solver and preconditioner will depend significantly on the problem that needs to be solved, and are often the topic of active research.

CG works for symmetric, GMRES for unsymmetric systems, but GMRES has a larger memory footprint. (Trilinos has many more specialized Krylov solvers. The Belos Doxygen is a good starting point, but some newer communication reducing algorithms have not yet been fully documented.)
One-level preconditioners (such as Jacobi and Gauss-Seidel) will often not lead to a scalable solver.
Multigrid solvers are scalable (on Poisson), but getting good performance often involves some parameter tuning.

MueLu on next-generation platforms

MueLu has specialized kernels that allow it to run on next-generation computing platforms such as KNLs and GPUs, using a Kokkos backend. If MueLu has been compiled with OpenMP or CUDA support, this code can be enabled at runtime by setting the parameter use kokkos refactor to true.

Add the openmpi-2.1.5 and cuda-10.0 modules to your environment.

soft add +openmpi-2.1.5
soft add +cuda-10.0
export CUDA_LAUNCH_BLOCKING=1
export CUDA_MANAGED_FORCE_DEVICE_ALLOC=1

Try running

./MueLu_Stratimikos_gpu.exe --xml=mg-gpu.xml --nx=1000 --ny=1000 --timings | egrep "total solve time|Number of Iterations"

with the refactor option set.

If you want to use both GPUs, run

mpiexec -n 2 ./MueLu_Stratimikos_gpu.exe --xml=mg-gpu.xml --nx=1000 --ny=1000 --kokkos-num-devices=2

Running your own problem

The executable has the option to load the linear system and the right-hand side from MatrixMarket files, e.g.,

./MueLu_Stratimikos.exe --matrix=poisson-matrix.m --rhs=poisson-rhs.m --coords=poisson-coords.m