Linear Solvers
Krylov Solvers and Preconditioning with MueLu/Trilinos
At a Glance
Questions  Objectives  Key Points 
How do we choose a suitable Krylov solver?  Know when to use CG or GMRES.  CG works for spd matrix and preconditioner. GMRES works for unsymmetric systems, but requires more memory. 
How do we choose a preconditioner?  Know common sparse preconditioners.  As the size of the linear system grows, most iterative methods will require increasing number of iterations. 
How can we improve efficiency of the solver?  Understand the basic components of multigrid.  For certain common problem classes, multigrid methods require a constant number of iterations and constant work per unknown. 
Table of contents
 Set 1  Krylov solver, no preconditioner
 Set 2  Krylov solver, simple preconditioners
 Set 3  Krylov solver, multigrid preconditioner
To begin this lesson
Make sure that you have followed the setup instructions.
Go to the directory for the Krylov application
cd track5numerical/krylov_amg_muelu
The Problem Being Solved
The Poisson equation arises in electrostatics, incompressible fluid flow simulations, and numerous other applications. We will consider the Poisson equation \[\Delta u = f\]
on a square mesh of size \(n_x \times n_y\) with Dirichlet boundary conditions \(u = 0\).
It is discretized using central finite differences, leading to a symmetric positive definite (spd) matrix.
The Example Source Code
For this lesson, we will be using the executable MueLu_Stratimikos.exe
from the MueLu package of Trilinos which allows us to test a variety of solvers and preconditioners.
For the first part of the lesson, we will be running on a single MPI rank, so no need to ask for a massive allocation.
The executable takes several command line arguments that influence the linear system that is generated on the fly or read in from file. A complete set of options will be printed by typing
./MueLu_Stratimikos.exe help
The most important ones are:
Usage: ./MueLu_Stratimikos.exe [options]
options:
help Prints this help message
nx int mesh points in xdirection.
(default: nx=100)
ny int mesh points in ydirection.
(default: ny=100)
nz int mesh points in zdirection.
(default: nz=100)
matrixType string matrix type: Laplace1D, Laplace2D, Laplace3D, ...
(default: matrixType="Laplace2D")
xml string read parameters from an xml file
(default: xml="stratimikos_ParameterList.xml")
yaml string read parameters from a yaml file
(default: yaml="")
Solvers (such as CG and GMRES) and preconditioners (such as Jacobi, GaussSeidel and multigrid) are configured via parameter files.
Trilinos supports both XML and YAML parameter files.
In order to make following along as simple as possible and avoid typos, parameter XML files for all examples can be found in the lesson folder.
If you want to see the difference between to input files, run diff a.xml b.xml
.
Running the Example
Set 1  Krylov solver, no preconditioner
The default Krylov method is GMRES, and no preconditioner is used.
Run
./MueLu_Stratimikos.exe xml=set1gmres.xml
Expected Behavior/Output
You should see output like this:
========================================================
Xpetra::Parameters
Linear algebra library: Tpetra
Galeri::Xpetra::Parameters<int>
Matrix type: Laplace2D
Problem size: 10000 (100x100)
Processor subdomains in x direction: 1
Processor subdomains in y direction: 1
Processor subdomains in z direction: 1
========================================================
Galeri complete.
========================================================
*******************************************************
***** Belos Iterative Solver: Pseudo Block Gmres
***** Maximum Iterations: 100
***** Block Size: 1
***** Residual Test:
***** Test 1 : Belos::StatusTestImpResNorm<>: (2Norm Res Vec) / (2Norm Prec Res0), tol = 1e08
*******************************************************
Iter 0, [ 1] : 1.000000e+00
Iter 10, [ 1] : 2.934358e03
Iter 20, [ 1] : 4.605673e04
Iter 30, [ 1] : 1.647752e04
Iter 40, [ 1] : 7.453272e05
Iter 50, [ 1] : 4.822835e05
Iter 60, [ 1] : 2.768790e05
Iter 70, [ 1] : 1.712668e05
Iter 80, [ 1] : 1.194621e05
Iter 90, [ 1] : 9.782121e06
Iter 100, [ 1] : 7.221948e06
Passed.......OR Combination >
Failed.......Number of Iterations = 100 == 100
Unconverged..(2Norm Res Vec) / (2Norm Prec Res0)
residual [ 0 ] = 7.22195e06 > 1e08
Examining Results
We observe the following:
 We constructed a 2D Laplace problem on a square mesh of 100 x 100 nodes, resulting in a linear system of 10000 unknowns.
 We have set two termination criteria for the solve: 100 iterations or a reduction of the residual norm by 8 orders of magnitude.
 The solve failed, since we reached 100 iterations, but only reduced the residual norm by a factor of 7e06.
Now, modify the input file to use the conjugate gradient method.
We change the Solver Type
parameter to Pseudo Block CG
.
Run
./MueLu_Stratimikos.exe xml=set1cg.xml
No, neither solver manages to converge in less than 100 iterations.
CG has significantly lower memory requirements, since it uses a short recurrence. GMRES, on the other hand, has to keep the entire Krylov space around.
You can check the last answer by comparing the approximate memory usage of CG and GMRES using
/usr/bin/time v ./MueLu_Stratimikos.exe nx=1000 ny=1000 xml=set1gmres.xml 2>&1  grep "Maximum resident set size"
/usr/bin/time v ./MueLu_Stratimikos.exe nx=1000 ny=1000 xml=set1cg.xml 2>&1  grep "Maximum resident set size"
We used a larger problem to be able to see the difference more clearly.
In what follows, we will be using the CG solver.
Set 2  Krylov solver, simple preconditioners
We now explore some simple (and quite generic) options for preconditioning the problem.
So far, the Preconditioner Type
parameter was set to None
, meaning no preconditioning.
We now use Ifpack2
instead.
Ifpack2 is another Trilinos package which provides a number of different simple preconditioners.
Moreover, we add the following configuration for Ifpack2.
<ParameterList name="Ifpack2">
<Parameter name="Prec Type" type="string" value="relaxation"/>
<ParameterList name="Ifpack2 Settings">
<Parameter name="relaxation: type" type="string" value="Symmetric GaussSeidel"/>
<Parameter name="relaxation: sweeps" type="int" value="1"/>
</ParameterList>
</ParameterList>
This means that a single sweep of symmetric GaussSeidel is used for preconditioning.
Run
./MueLu_Stratimikos.exe xml=set2sgs1.xml
We can strengthen the preconditioner by increasing the number of symmetric GaussSeidel sweeps we are using as a preconditioner.
We switch relaxation: sweeps
to 3.
Run
./MueLu_Stratimikos.exe xml=set2sgs3.xml
and verify that the number of iterations further decreased.
Now, we will check whether we have created a scalable solver strategy.
Record the number of iterations for different problem sizes by running
./MueLu_Stratimikos.exe xml=set2sgs3.xml nx=50 ny=50
./MueLu_Stratimikos.exe xml=set2sgs3.xml nx=100 ny=100
./MueLu_Stratimikos.exe xml=set2sgs3.xml nx=200 ny=200
(This means that we are running the same 2D Laplace problem as above, but on meshes of size 50x50, etc.)
No, the number of iterations increases with the problem size.
The number of iterations taken by CG scales with the square root of the condition number \(\kappa(PA)\) of the preconditioned system, where \(P\) is the preconditioner.
In each step, the number of iterations grows by a factor of 2, and the number of unknowns grows by a factor of 4. Hence the condition number is proportional to the number of unknowns.
Stop. If you are following the lecture slides, please return to the slides before continuing to set 3.
Set 3  Krylov solver, multigrid preconditioner
The reason that the GaussSeidel preconditioner did not work well is that it effectively only reduces error locally, but not globally. We hence need a global mechanism of error correction, which can be provided by adding one or more coarser grids.
We switch the Preconditioner Type
parameter to MueLu
, which is an algebraic multigrid package in Trilinos.
Run
./MueLu_Stratimikos.exe xml=set3mgjacobi.xml nx=50 ny=50
./MueLu_Stratimikos.exe xml=set3mgjacobi.xml nx=100 ny=100
./MueLu_Stratimikos.exe xml=set3mgjacobi.xml nx=200 ny=200
Yes, the number of iterations stays more or less constant as we increase the problem size.
Understanding information about the multigrid preconditioner
Let’’s look a little more closely at the output from the largest example.
Rerun:
./MueLu_Stratimikos.exe xml=set3mgjacobi.xml nx=200 ny=200
The multigrid summary provides the following information:
 The number of multigrid levels created, including the linear system of interest
 Operator complexity
 Smoother complexity
 The multigrid cycle type
 Matrix statistics for each level (rows, number of nonzeros, number of processors)
 The smoother used on each level
The operator complexity is given by the formula \[\frac{\Sigma_0^L nnz(A_i)}{nnz(A_0)},\]
where \(A_0\) is the fine level matrix.
The complexity gives the ratio of nonzeros in all matrices compared to the fine level matrix. This is roughly the ratio of FLOPs required by a matrixvector product performed on each matrix relative to the fine level matrix.
The smoother complexity is the ratio of FLOPS required by smoothers on all levels to FLOPs required by just the fine level smoother.
A smoother such as an incomplete factorization will have a much higher FLOP intensity than a matrixvector product.
Effect of different smoothers
The first adjustment that we want to make is to select different smoothers. This involves the following tradeoff: choosing a more effective smoother should reduce the number of iterations, but might involve more computation.
By default, we use a single sweep of Jacobi smoothing, which is very cheap.
First, we run
./MueLu_Stratimikos.exe xml=set3mgjacobi.xml timings nx=1000 ny=1000
to display timing information on a large enough problem.
The relevant timer to look at is Belos: PseudoBlockCGSolMgr total solve time
.
(You might want to run this more than once in case you are experiencing some system noise.)
Since there are quite a lot of timers, you could grep for the iteration count and the timer by appending
 egrep "total solve timeNumber of Iterations"
to the command, i.e.,
./MueLu_Stratimikos.exe xml=set3mgjacobi.xml timings nx=1000 ny=1000  egrep "total solve timeNumber of Iterations"
We know that GaussSeidel is a better smoother than Jacobi. There are two ways of using GaussSeidel while keeping the preconditioner symmetric: we can either use different directions in the sweeps in pre and postsmoothing, or use a symmetric GaussSeidel smoother for both.
Run
./MueLu_Stratimikos.exe xml=set3mgsgs.xml timings nx=1000 ny=1000
./MueLu_Stratimikos.exe xml=set3mggs.xml timings nx=1000 ny=1000
and compare the timings with the Jacobi case.
Yes. For symmetric GaussSeidel, the number of iterations decreases. For forward GaussSeidel for presmoothing and backwards GaussSeidel for postsmoothing, both number of iterations and timetosolution are reduced.
Now let’s see the effect of running GaussSeidel with increasing numbers of MPI ranks.
Run
mpiexec np 1 ./MueLu_Stratimikos.exe xml=set3mggs.xml timings nx=1000 ny=1000  egrep "total solve timeNumber of Iterations"
mpiexec np 12 ./MueLu_Stratimikos.exe xml=set3mggs.xml timings nx=1000 ny=1000  egrep "total solve timeNumber of Iterations"
The number of iterations changes slightly, while the solution time decreases.
GaussSeidel has limited opportunities for parallelism. Equation \(i\) cannot be solved until all equations \(j, j<i\) that \(i\) depends on have been solved.
Hint: Have a look at the GaussSeidel algorithm.
Another common smoother is a matrix polynomial, specifically, a Chebyshev polynomial. This type smoother has certain advantages over relaxation methods like Jacobi or GaussSeidel.
 Chebyshev will have better convergence properties than Jacobi.
 The Chebyshev computational kernel is a sparse matrixvector multiply (SpMV), which is invariant with respect to the number of processes.
 The SpMV kernel is naturally parallelizable with many highperformance implementations. There are limited opportunities for parallelism in GaussSeidel, e.g., coloring.
We switch the smother to Chebyshev. Repeat the above experiment.
mpiexec np 1 ./MueLu_Stratimikos.exe xml=set3mgchebyshev.xml timings nx=1000 ny=1000  egrep "total solve timeNumber of Iterations"
mpiexec np 12 ./MueLu_Stratimikos.exe xml=set3mgchebyshev.xml timings nx=1000 ny=1000  egrep "total solve timeNumber of Iterations"
Choosing a smoother that is computationally inexpensive but with poor convergence properties can result in a large number of solver iterations. Choosing a smoother that is computationally expensive but with good convergence properties can result in a small number of solver iterations, but overall long run times.
OutBrief
In this lesson, we have developed a scalable solver for a simple test problem, the Poisson equation.
A good choice of solver and preconditioner will depend significantly on the problem that needs to be solved, and are often the topic of active research.

CG works for symmetric, GMRES for unsymmetric systems, but GMRES has a larger memory footprint. (Trilinos has many more specialized Krylov solvers. The Belos Doxygen is a good starting point, but some newer communication reducing algorithms have not yet been fully documented.)

Onelevel preconditioners (such as Jacobi and GaussSeidel) will often not lead to a scalable solver.

Multigrid solvers are scalable (on Poisson), but getting good performance often involves some parameter tuning.
MueLu on nextgeneration platforms
MueLu has specialized kernels that allow it to run on nextgeneration computing platforms such as KNLs and GPUs,
using a Kokkos backend.
If MueLu has been compiled with OpenMP or CUDA support, this code can be enabled at runtime by setting the parameter use kokkos refactor
to true.
export CUDA_LAUNCH_BLOCKING=1
export CUDA_MANAGED_FORCE_DEVICE_ALLOC=1
export TPETRA_ASSUME_CUDA_AWARE_MPI=0
Try running
./MueLu_Stratimikos_gpu.exe xml=mggpu.xml nx=1000 ny=1000 timings node=cuda config
with the refactor option set.
If you want to use both GPUs, run
mpiexec n 2 ./MueLu_Stratimikos_gpu.exe xml=mggpu.xml nx=1000 ny=1000 kokkosnumdevices=2 node=cuda config
Running your own problem
The executable has the option to load the linear system and the righthand side from MatrixMarket files, e.g.,
./MueLu_Stratimikos.exe matrix=poissonmatrix.m rhs=poissonrhs.m coords=poissoncoords.m
Further Reading
Trilinos GitHub Repo Please feel free to submit questions, feature requests and bug reports to the issue tracker.