Nonlinear Solvers

Nonlinear Solvers with PETSc

Introduction to Nonlinear Solvers: Newton-Krylov Methods and Nonlinear Preconditioning

At a Glance

Questions	Objectives	Key Points
1. What are the tradeoffs between exactness and inexactness in Newton methods?	Observe trade-offs between inner and outer iterations as linear solver tolerance is varied	An inexact linear solver may be less robust but can result in significantly faster nonlinear solver
2. What makes a scalable inexact Newton method?	Observe execution time and convergence behavior as mesh spacing is decreased	Newton methods exhibit mesh independent convergence, but need to be combined with scalable linear solvers
3. How and when do nonlinear solvers fail?	Explore limits of nonlinear solvers by systematically increasinging problem nonlinearity	Nonlinear solvers can fail in a variety of ways, and some workarounds exist
4. Can we improve the robustness of Newton’s method by combining it with other solvers?	Explore nonlinear preconditioning for highly nonlinear problems	Nonlinear analogs of ideas from iterative linear solvers can significantly improve nonlinear solvers

Note: The executable for this lesson should be provided, but if it needs to be rebuilt do

cd track-5-numerical/nonlinear_solvers_petsc
make ex19

Introduction

Systems of nonlinear equations \[F(x) = b \quad \mathrm{where} \quad F : \mathbb{R}^N \to \mathbb{R}^N\]

arise in countless settings in computational science. Unlike their linear counterparts, direct methods for general nonlinear systems do not exist. Iterative methods are required!

In this lesson, we will do some hands-on exploration, solving a model nonlinear problem using the nonlinear solvers from the PETSc library. We will focus on variants on Newton’s method, exploring exact vs. inexact Newton methods, Newton-Krylov, and Newton-Krylov-multigrid methods. We will end with some exploration of a topic that is relatively unexplored both theoretically and experimentally: nonlinear preconditioning.

The Problem We Are Solving: the driven cavity CFD benchmark

Driven cavity steady-state solution

We will use the nonlinear solvers provided by the PETSc Scalable Nonlinear Equation Solvers (SNES) component to solve the steady-state nonisothermal driven cavity problem as implemented in SNES example ex19. This is a classic CFD benchmark that simulates a fluid-filled 2D box with a lid that moves at constant tangential velocity (probably employing a conveyor belt). Flow is driven by both the lid motion and buoyancy effects. Our example uses a velocity-vorticity formulation, in which the governing equations can be expressed as \[\begin{align*} - \Delta U - \partial_y \Omega &= 0 \\ - \Delta V + \partial_x \Omega &= 0 \\ - \Delta \Omega + \nabla \cdot ([U \Omega, V \Omega]) - \mathrm{Gr}\ \partial_x T &= 0 \\ - \Delta T + \mathrm{Pr}\ \nabla \cdot ([U T, V T]) &= 0 \end{align*}\]

where \(U\) and \(V\) are velocities, \(T\) is temperature, \(\Omega\) is vorticity, \(\mathrm{Gr}\) is the Grashof number and \(\mathrm{Pr}\) is the Prandtl number.

Example 1: Initial exploration and understanding PETSc options

Let’s begin by running the ex19 on a single MPI rank with some basic command line options.

mpiexec -n 1 ./ex19 -snes_monitor -snes_converged_reason -da_grid_x 16 -da_grid_y 16 -da_refine 2 -lidvelocity 100 -grashof 1e2

You should see output similar to

lid velocity = 100., prandtl # = 1., grashof # = 100.
SNES Function norm 7.681163231938e+02 
SNES Function norm 6.582880149343e+02 
SNES Function norm 5.294044874550e+02 
SNES Function norm 3.775102116141e+02 
SNES Function norm 3.047226778615e+02 
SNES Function norm 2.599983722908e+00 
SNES Function norm 9.427314747057e-03 
SNES Function norm 5.212213461756e-08 
Nonlinear solve converged due to CONVERGED_FNORM_RELATIVE iterations 7
Number of SNES iterations = 7

The table below explains the options we have used:

Option Flag	Effect
`-snes_monitor`	Show progress of the SNES solver
`-snes_converged_reason`	Print reason for SNES convergence or divergence
`-da_grid_x 16`	Set initial grid points in x direction to 16
`-da_grid_y 16`	Set initial grid poings in y direction to 16
`-da_refine 2`	Refine the initial grid 2 times before creation
`-lidvelocity 100`	Set dimensionless velocity of lid to 100
`-grashof 1e2`	Set Grashof number to 1e2

An element of the PETSc design philosphy is extensive runtime customizability. We can always use -help to enumerate and explain the various command-line options available to a PETsc executable. (Note that the help will be tailored to according to the set of options being specified.)

To see the details of how our SNES solver is configured, we can add -snes_view, which will print information at the end of the run about the SNES object and the underlying KSP (linear solvers) and PC (preconditioners) objects.

Example `-snes_view` output

Nonlinear solve converged due to CONVERGED_FNORM_RELATIVE iterations 7
SNES Object: 1 MPI processes
  type: newtonls
  maximum iterations=50, maximum function evaluations=10000
  tolerances: relative=1e-08, absolute=1e-50, solution=1e-08
  total number of linear solver iterations=835
  total number of function evaluations=11
  norm schedule ALWAYS
  Jacobian is built using colored finite differences on a DMDA
  SNESLineSearch Object: 1 MPI processes
    type: bt
      interpolation: cubic
      alpha=1.000000e-04
    maxstep=1.000000e+08, minlambda=1.000000e-12
    tolerances: relative=1.000000e-08, absolute=1.000000e-15, lambda=1.000000e-08
    maximum iterations=40
  KSP Object: 1 MPI processes
    type: gmres
      restart=30, using Classical (unmodified) Gram-Schmidt Orthogonalization with no iterative refinement
      happy breakdown tolerance 1e-30
    maximum iterations=10000, initial guess is zero
    tolerances:  relative=1e-05, absolute=1e-50, divergence=10000.
    left preconditioning
    using PRECONDITIONED norm type for convergence test
  PC Object: 1 MPI processes
    type: ilu
      out-of-place factorization
      0 levels of fill
      tolerance for zero pivot 2.22045e-14
      matrix ordering: natural
      factor fill ratio given 1., needed 1.
        Factored matrix follows:
          Mat Object: 1 MPI processes
            type: seqaij
            rows=14884, cols=14884, bs=4
            package used to perform factorization: petsc
            total: nonzeros=293776, allocated nonzeros=293776
            total number of mallocs used during MatSetValues calls=0
              using I-node routines: found 3721 nodes, limit used is 5
    linear system matrix = precond matrix:
    Mat Object: 1 MPI processes
      type: seqaij
      rows=14884, cols=14884, bs=4
      total: nonzeros=293776, allocated nonzeros=293776
      total number of mallocs used during MatSetValues calls=0
        using I-node routines: found 3721 nodes, limit used is 5

Managing PETSc Options

PETSc offers a very large number of runtime options. All can be set via command line, but can also be set from input files and shell environment variables.

To facilitate readability, we’ll put the command-line arguments common to the remaining hands-on exercises in the PETSC_OPTIONS environment variable.

export PETSC_OPTIONS="-snes_monitor -snes_converged_reason -lidvelocity 100 -da_grid_x 16 -da_grid_y 16 -ksp_converged_reason -log_view :log.txt"

We’ve added -ksp_converged_reason to see how and when linear solver halts.

We’ve also added -log_view to write the PETSc performance logging info to a file, log.txt. Such performance logs are full of useful information for understanding the performance of a PETSc code, but we don’t have time to explore them in this session. (For those interested, see Chapter 13, “Profiling”, of the PETSc User Manual.) Here, we will simply use the performance logs to find the overall wall-clock time via, which we can quickly find by using the grep utility:

grep Time\ \(sec\): log.txt

(The first number returned is the total run time in seconds.)

IMPORTANT: When you finish these hands-on lessons, be sure to clear your PETSC_OPTIONS environment variable (do “unset PETSC_OPTIONS”) if you will be doing hands-on exercises in another session. Otherwise, you may get unexpected behavior if running executables that are linked against PETSc.

Example 2: Exact vs. Inexact Newton

The output from running -snes_view in the previous exercise shows us that PETSc defaults to an inexact Newton method. To run with exact Newton (and to check the execution time), we use -pc_type lu, which indicates to the KSP object (which controls the linear solver) that the underlying preconditioner should be an full LU decomposition:

mpiexec -n 1 ./ex19 -da_refine 2 -grashof 1e2 -pc_type lu
grep Time\ \(sec\): log.txt

(Remember, the above command assumes that you have set PETSC_OPTIONS as specified in the preceding section.)

Sample output on my laptop (Dell XPS 13 with Intel Core i7-10710U CPU)

rmills@encke:~/proj/petsc/src/snes/tutorials (master=)$ mpiexec -n 1 ./ex19 -da_refine 2 -grashof 1e2 -pc_type lu
lid velocity = 100., prandtl # = 1., grashof # = 100.
  0 SNES Function norm 7.681163231938e+02 
  Linear solve converged due to CONVERGED_RTOL iterations 1
  1 SNES Function norm 6.581690463182e+02 
  Linear solve converged due to CONVERGED_RTOL iterations 1
  2 SNES Function norm 5.291809327801e+02 
  Linear solve converged due to CONVERGED_RTOL iterations 1
  3 SNES Function norm 3.772079270664e+02 
  Linear solve converged due to CONVERGED_RTOL iterations 1
  4 SNES Function norm 3.040001036822e+02 
  Linear solve converged due to CONVERGED_RTOL iterations 1
  5 SNES Function norm 2.518761157519e+00 
  Linear solve converged due to CONVERGED_RTOL iterations 1
  6 SNES Function norm 9.230363584762e-03 
  Linear solve converged due to CONVERGED_RTOL iterations 1
  7 SNES Function norm 6.155757710494e-09 
Nonlinear solve converged due to CONVERGED_FNORM_RELATIVE iterations 7
Number of SNES iterations = 7
rmills@encke:~/proj/petsc/src/snes/tutorials (master=)$ grep Time\ \(sec\): log.txt
Time (sec):           6.728e-01     1.000   6.728e-01

The work required to solve the inner, linear interation so precisely is likely wasted. Let’s try using the default iterative solver with some different tolerances. Start with

mpiexec -n 1 ./ex19 -da_refine 2 -grashof 1e2 -ksp_rtol 1e-8

and then try some larger values for relative convergence tolerance, -ksp_rtol. Try -ksp_rtol 1e-5 (the PETSc default) next, and try increasing it by an order of magnitude each time.

What happens to the SNES iteration count? When does the SNES solve diverge?

The SNES iteration count is 7 initially, then increases to 8 at a relative tolerance of 1e-3 for the linear solver, and then 10 at a tolerance of 1e-2. The SNES solve diverges when run with a linear solver tolerance of 1e-1.

What yields the shortest execution time?

On my laptop, the loosest tolerance I tried, -ksp_rtol 1e-2, turns out to be fastest, able to solve the problem in 0.43 seconds. Though it requires more Newton iterations, it performs far fewer linear solver iterations. Running with the default tolerance of 1e-5 requires 0.60 seconds. A tolerance of 1e-8 is far too high, resulting in many linear solver iterations and an execution time of over 1.2 seconds. Using LU factorization turns out to not be too bad for this small problem (around 0.68 seconds), but it is difficult for LU to scale up to large problems.

Example 3: Scaling up the grid size and running in parallel

Let’s explore what happens as we scale up the grid size for our model problem.

For this exercise, we will run in parallel because experiments may take too long otherwise. We will use a fixed number of MPI ranks, even though this number is really too large for the smaller grids, to eliminate effects due to varying the size of the domains used by the default parallel preconditioner (block Jacobi with ILU(0) applied on each block). We also use BiCGStab (-ksp_type bcgs) instead of the default linear solver, GMRES(30), which will fail for some cases.

Using the linear solver defaults, increase the size of the grid (that is, decrease the grid spacing) and observe what happens to iteration counts and execution times:

mpiexec -n 8 ./ex19 -ksp_type bcgs -grashof 1e2 -da_refine 2
mpiexec -n 8 ./ex19 -ksp_type bcgs -grashof 1e2 -da_refine 3
mpiexec -n 8 ./ex19 -ksp_type bcgs -grashof 1e2 -da_refine 4 

Sample output for `-da_refine 4` case

$ mpiexec -n 8 ./ex19 -ksp_type bcgs -grashof 1e2 -da_refine 4
lid velocity = 100., prandtl # = 1., grashof # = 100.
  0 SNES Function norm 1.545962539057e+03 
  Linear solve converged due to CONVERGED_RTOL iterations 125
  1 SNES Function norm 9.782056791818e+02 
  Linear solve converged due to CONVERGED_RTOL iterations 128
  2 SNES Function norm 6.621936395441e+02 
  Linear solve converged due to CONVERGED_RTOL iterations 389
  3 SNES Function norm 3.155672307430e+00 
  Linear solve converged due to CONVERGED_RTOL iterations 470
  4 SNES Function norm 8.129969608884e-03 
  Linear solve converged due to CONVERGED_RTOL iterations 425
  5 SNES Function norm 8.852310456001e-08 
Nonlinear solve converged due to CONVERGED_FNORM_RELATIVE iterations 5

You should observe that the Newton solver shows “mesh independence” – that is, as we refine the grid spacing, we see roughly the same convergence behavior for the Newton iterates. This is not true, however, for the linear solver, which shows unsustainable growth in the number of iterations it requires.

What happens if we employ a Newton-Krylov-multigrid method? Add -pc_type mg to use a geometric multigrid preconditioner (defaults to a V-cycle, but check out the -help output to see how to use other types; you may also want to try -snes_view to see the multigrid hierarchy):

mpiexec -n 8 ./ex19 -ksp_type bcgs -grashof 1e2 -pc_type mg -da_refine 2
mpiexec -n 8 ./ex19 -ksp_type bcgs -grashof 1e2 -pc_type mg -da_refine 3
mpiexec -n 8 ./ex19 -ksp_type bcgs -grashof 1e2 -pc_type mg -da_refine 4 

Sample output for `-pc_type mg -da_refine 4` case

mpiexec -n 8 ./ex19 -ksp_type bcgs -grashof 1e2 -pc_type mg -da_refine 4
lid velocity = 100., prandtl # = 1., grashof # = 100.
  0 SNES Function norm 1.545962539057e+03 
  Linear solve converged due to CONVERGED_RTOL iterations 6
  1 SNES Function norm 9.778196290981e+02 
  Linear solve converged due to CONVERGED_RTOL iterations 6
  2 SNES Function norm 6.609659458090e+02 
  Linear solve converged due to CONVERGED_RTOL iterations 7
  3 SNES Function norm 2.791922927549e+00 
  Linear solve converged due to CONVERGED_RTOL iterations 6
  4 SNES Function norm 4.973591997243e-03 
  Linear solve converged due to CONVERGED_RTOL iterations 6
  5 SNES Function norm 3.241555827567e-05 
  Linear solve converged due to CONVERGED_RTOL iterations 9
  6 SNES Function norm 9.883136583477e-10 
Nonlinear solve converged due to CONVERGED_FNORM_RELATIVE iterations 6

The takeaway here is that the combination of the fast convergence of a globalized Newton method and a multigrid preconditioner for the inner, linear solve can be a powerful and highly scalable solver.

Example 4: Increasing the strength of the nonlinearity

Let’s explore what happens as we increase the strength of the nonlinearity by raising the Grashof number. Try running

mpiexec -n 1 ./ex19 -da_refine 2 -grashof 1e2
mpiexec -n 1 ./ex19 -da_refine 2 -grashof 1e3
mpiexec -n 1 ./ex19 -da_refine 2 -grashof 1e4
mpiexec -n 1 ./ex19 -da_refine 2 -grashof 1.3e4

Sample output for `mpiexec -n 1 ./ex19 -da_refine 2 -grashof 1.3e4`

mpiexec -n 1 ./ex19 -da_refine 2 -grashof 1.3e4
lid velocity = 100., prandtl # = 1., grashof # = 13000.
  0 SNES Function norm 7.971152173639e+02 
  Linear solve did not converge due to DIVERGED_ITS iterations 10000
Nonlinear solve did not converge due to DIVERGED_LINEAR_SOLVE iterations 0

Oops! At a Grashof number of 1.3e4, we get a failure in the linear solver. Let’s see if a stronger preconditioner can help us:

mpiexec -n 1 ./ex19 -da_refine 2 -grashof 1.3e4 -pc_type mg

Sample output for `mpiexec -n 1 ./ex19 -da_refine 2 -grashof 1.3e4 -pc_type mg`

mpiexec -n 1 ./ex19 -da_refine 2 -grashof 1.3e4 -pc_type mg
lid velocity = 100., prandtl # = 1., grashof # = 13000.
  ...
  4 SNES Function norm 3.209967262833e+02 
  Linear solve converged due to CONVERGED_RTOL iterations 9
  5 SNES Function norm 2.121900163587e+02 
  Linear solve converged due to CONVERGED_RTOL iterations 9
  6 SNES Function norm 1.139162432910e+01 
  Linear solve converged due to CONVERGED_RTOL iterations 8
  7 SNES Function norm 4.048269317796e-01 
  Linear solve converged due to CONVERGED_RTOL iterations 8
  8 SNES Function norm 3.264993685206e-04 
  Linear solve converged due to CONVERGED_RTOL iterations 8
  9 SNES Function norm 1.154893029612e-08 
Nonlinear solve converged due to CONVERGED_FNORM_RELATIVE iterations 9

Success! But what if we increase the Grashof number a little more? Try

mpiexec -n 1 ./ex19 -da_refine 2 -grashof 1.3373e4 -pc_type mg

Sample output for `mpiexec -n 1 ./ex19 -da_refine 2 -grashof 1.3373e4 -pc_type mg`

lid velocity = 100., prandtl # = 1., grashof # = 13373.
...
 48 SNES Function norm 3.124919801005e+02 
  Linear solve converged due to CONVERGED_RTOL iterations 17
 49 SNES Function norm 3.124919800338e+02 
  Linear solve converged due to CONVERGED_RTOL iterations 17
 50 SNES Function norm 3.124919799645e+02 
Nonlinear solve did not converge due to DIVERGED_MAX_IT iterations 50

No good! Let’s try brute force and employ -pc_type lu:

mpiexec -n 1 ./ex19 -da_refine 2 -grashof 1.3373e4 -pc_type lu

Sample output for `mpiexec -n 1 ./ex19 -da_refine 2 -grashof 1.3373e4 -pc_type lu`

mpiexec -n 1 ./ex19 -da_refine 2 -grashof 1.3373e4 -pc_type lu
...
 48 SNES Function norm 3.193724239842e+02 
  Linear solve converged due to CONVERGED_RTOL iterations 1
 49 SNES Function norm 3.193724232621e+02 
  Linear solve converged due to CONVERGED_RTOL iterations 1
 50 SNES Function norm 3.193724181714e+02 
Nonlinear solve did not converge due to DIVERGED_MAX_IT iterations 50

We eventually reach a point that seems to be beyond the capabilities of our Newton solver. What now?

Example 5: Nonlinear Richardson Preconditioned with Newton

Since our Newton solver is unable to make progress on its own, let’s try combining it with another nonlinear solver. We will try nonlinear Richardson iteration, preconditioned with Newton’s method:

mpiexec -n 1 ./ex19 -da_refine 2 -grashof 1.3373e4 -snes_type nrichardson -npc_snes_type newtonls -npc_snes_max_it 4 -npc_pc_type mg

Sample output

lid velocity = 100., prandtl # = 1., grashof # = 13373.
Nonlinear solve did not converge due to DIVERGED_INNER iterations 0
Number of SNES iterations = 0

mpiexec -n 1 ./ex19 -da_refine 2 -grashof 1.3373e4 -snes_type nrichardson -npc_snes_type newtonls -npc_snes_max_it 4 -npc_pc_type lu

Sample output

lid velocity = 100., prandtl # = 1., grashof # = 13373.
SNES Function norm 7.987708558131e+02 
SNES Function norm 8.467169687854e+02 
SNES Function norm 7.300096001529e+02 
SNES Function norm 5.587232361127e+02 
SNES Function norm 3.071143076019e+03 
SNES Function norm 3.347748537471e+02 
SNES Function norm 1.383297972324e+01 
SNES Function norm 1.209841384629e-02 
SNES Function norm 8.660606193428e-09 
Nonlinear solve converged due to CONVERGED_FNORM_RELATIVE iterations 8

So nonlinear Richardson preconditioned with Newton has managed to get us further than Newton alone. Let’s try increasing the Grashof number a little more:

mpiexec -n 1 ./ex19 -da_refine 2 -grashof 1.4e4 -snes_type nrichardson -npc_snes_type newtonls -npc_snes_max_it 4 -npc_pc_type lu

Sample output

lid velocity = 100., prandtl # = 1., grashof # = 14000.
...
 37 SNES Function norm 5.992348444448e+02 
 38 SNES Function norm 5.992348444290e+02 
Nonlinear solve did not converge due to DIVERGED_INNER iterations 38

We’ve hit another barrier. What about switching things up? Let’s try preconditioning Newton with nonlinear Richardson.

Example 6: Newton Preconditioned with Nonlinear Richardson

mpiexec -n 1 ./ex19 -da_refine 2 -grashof 1.4e4 -pc_type mg -npc_snes_type nrichardson -npc_snes_max_it 1 -snes_max_it 1000

Sample output

...
352 SNES Function norm 2.145588832260e-02 
  Linear solve converged due to CONVERGED_RTOL iterations 7
353 SNES Function norm 1.288292314235e-05 
  Linear solve converged due to CONVERGED_RTOL iterations 8
354 SNES Function norm 3.219155715396e-10 
Nonlinear solve converged due to CONVERGED_FNORM_RELATIVE iterations 354

Well, this did eventually work, but it’s pretty slow. Why don’t we try upping the number of iterations of the inner, nonlinear Richardson solver?

mpiexec -n 1 ./ex19 -da_refine 2 -grashof 1.4e4 -pc_type mg -npc_snes_type nrichardson -npc_snes_max_it 3

Sample output

...
 23 SNES Function norm 4.796734188970e+00 
  Linear solve converged due to CONVERGED_RTOL iterations 7
 24 SNES Function norm 2.083806106198e-01 
  Linear solve converged due to CONVERGED_RTOL iterations 8
 25 SNES Function norm 1.368771861149e-04 
  Linear solve converged due to CONVERGED_RTOL iterations 8
 26 SNES Function norm 1.065794992653e-08 
Nonlinear solve converged due to CONVERGED_FNORM_RELATIVE iterations 26

Much improved! Can we do even better?

mpiexec -n 1 ./ex19 -da_refine 2 -grashof 1.4e4 -pc_type mg -npc_snes_type nrichardson -npc_snes_max_it 4
mpiexec -n 1 ./ex19 -da_refine 2 -grashof 1.4e4 -pc_type mg -npc_snes_type nrichardson -npc_snes_max_it 5
mpiexec -n 1 ./ex19 -da_refine 2 -grashof 1.4e4 -pc_type mg -npc_snes_type nrichardson -npc_snes_max_it 6
mpiexec -n 1 ./ex19 -da_refine 2 -grashof 1.4e4 -pc_type mg -npc_snes_type nrichardson -npc_snes_max_it 7

Sample output

lid velocity = 100., prandtl # = 1., grashof # = 14000.
  0 SNES Function norm 8.016512665033e+02 
  Linear solve converged due to CONVERGED_RTOL iterations 11
  1 SNES Function norm 7.961475922316e+03 
  Linear solve converged due to CONVERGED_RTOL iterations 10
  2 SNES Function norm 3.238304139699e+03 
  Linear solve converged due to CONVERGED_RTOL iterations 10
  3 SNES Function norm 4.425107973263e+02 
  Linear solve converged due to CONVERGED_RTOL iterations 9
  4 SNES Function norm 2.010474128858e+02 
  Linear solve converged due to CONVERGED_RTOL iterations 8
  5 SNES Function norm 2.936958163548e+01 
  Linear solve converged due to CONVERGED_RTOL iterations 8
  6 SNES Function norm 1.183847022611e+00 
  Linear solve converged due to CONVERGED_RTOL iterations 8
  7 SNES Function norm 6.662829301594e-03 
  Linear solve converged due to CONVERGED_RTOL iterations 7
  8 SNES Function norm 6.170083332176e-07 
Nonlinear solve converged due to CONVERGED_FNORM_RELATIVE iterations 8

Newton preconditioned with nonlinear Richardson can be pushed quite far! Try

mpiexec -n 1 ./ex19 -da_refine 2 -grashof 1e6 -pc_type lu -npc_snes_type nrichardson -npc_snes_max_it 7 -snes_max_it 1000

Sample output

lid velocity = 100., prandtl # = 1., grashof # = 1e+06
...
 69 SNES Function norm 4.241700887134e+00 
  Linear solve converged due to CONVERGED_RTOL iterations 1
 70 SNES Function norm 3.238739735055e+00 
  Linear solve converged due to CONVERGED_RTOL iterations 1
 71 SNES Function norm 1.781881532852e+00 
  Linear solve converged due to CONVERGED_RTOL iterations 1
 72 SNES Function norm 1.677710773493e-05 
Nonlinear solve converged due to CONVERGED_FNORM_RELATIVE iterations 72

Wow! The solver converges with a Grashof number of one million. If you want to explore further, see how much further you can push this!

Take-Away Messages

PETSc provides a wide assortment of nonlinear solvers through the SNES component. With it, users can build sophisticated solvers from composable algorithmic components:

Inner, linear solves can employ full range of solvers and preconditioners provided by PETSc KSP and PC
- Multigrid solvers particularly important for mesh size-independent convergence
Composite nonlinear solvers can be built analogously, using building blocks from PETSc SNES

Newton-Krylov methods are predominant, but there is a large design space of “composed” nonlinear solvers to explore:

Not well-explored theoretically or experimentally (interesting research opportunities!)
Composed nonlinear solvers can be very powerful, though frustratingly fragile *As a rule of thumb, nonlinear Richardson, Gauss-Seidel, or NGMRES with Newton often improves robustness

Further items to explore include

Nonlinear domain decomposition (SNESASPIN/SNESNASM) and nonlinear multigrid (or Full Approximation Scheme, SNESFAS) methods
PETSc timesteppers use SNES to solve nonlinear problems at each time step
- Pseudo-transient continuation (TSPSEUDO) can solve highly nonlinear steady-state problems

Extra Credit: Running Nonlinear Solvers Using PETSc’s GPU Back-Ends

The concepts we have covered so far are mostly orthogonal to the topic of how to run PETSc solvers on GPUs. Since computing on GPUs has become so important, however, we suggest a few exercises here for students who want a brief look at GPU support in PETSc and how its nonlinear solvers can be executed on GPUs.

Before we begin, let’s clear the PETSC_OPTIONS environment variable were were using:

unset PETSC_OPTIONS

We will run a large version (-dm_refine 9) of the driven cavity problem, using multigrid with GPU and SIMD-friendly Chebyshev-Jacobi smoothing (-mg_levels_pc_type jacobi), and collect a breakdown by multigrid level (-pc_mg_log), logging performance in a text file.

We get the best CPU-only performance on ThetaGPU using 8 MPI ranks:

mpiexec -n 8 ./ex19 -da_refine 9 -pc_type mg -mg_levels_pc_type jacobi -pc_mg_log -log_view :log_mg_cpu_n8.txt

Running on GPU, we get best performance using only one rank (we could probably use more if running NVIDIA MPS, but this is not enabled on ThetaGPU):

mpiexec -n 1 ./ex19 -da_refine 9 -pc_type mg -mg_levels_pc_type jacobi -pc_mg_log -dm_vec_type cuda -dm_mat_type aijcusparse -log_view_gpu_time -log_view :log_mg_gpu_n1.txt

(The -log_view_gpu_time option is actually not needed in the 3.17 release of PETSc, but in future releases it will be required to get full timings for all events on the GPU. This change is being added because of the overhead associated with collecting these timings.)

Opening up the .txt versions of the -log_view files will show us a lot of data about the performance. There is a useful overall summary at the top, and then timings for many “events”, which occur during different “stages” of the computation. When PETSc has been built with GPU support, several additional columns, showing copies between CPU and GPU memory and the percentage of flops computed on the GPU, are present. The events that are logged are not exclusive: The SNESSolve time is that required to solve our entire problem. But the time for other events, such as KSPSolve (invoked in solving the Jacobian system inside the Newton solve), are included in the SNESSolve time.

It can be difficult to tell from the text output how the logging events are nested. So let’s repeat the same runs, but this time generate stack trace files that can be used to generate flame graphs.

On 8 MPI ranks for CPU-only:

mpiexec -n 8 ./ex19 -da_refine 9 -pc_type mg -mg_levels_pc_type jacobi -pc_mg_log -log_view :log_mg_cpu_n8.stack:ascii_flamegraph

And on 1 MPI rank for the GPU case:

mpiexec -n 1 ./ex19 -da_refine 9 -pc_type mg -mg_levels_pc_type jacobi -pc_mg_log -dm_vec_type cuda -dm_mat_type aijcusparse -log_view_gpu_time -log_view :log_mg_gpu_n1.stack:ascii_flamegraph

Download the .stack files to your local machine, and then use SpeedScope.app to generate interactive flame graphs of the performance data, which will let you examine the hierarchy of PETSc events and their relative costs.

One thing to note is the relative distribution of time in MGSmooth events (part of PCApply). These are the application of the smoother on a multigrid level; MGSmooth Level 0 corresponds to the smoother application on the coarsest level of the multigrid hierarchy. See how the smoother application on coarse levels all take roughly the same time on the GPU? This points to the high kernel launch latency. What other noteworthy differences can you find?

Additional Things To Try

Look at sources of speedup (or slowdown!) for the GPU vs. CPU

The total time in SNESSolve tells us the time required to solve our entire problem. Compare these for the CPU and GPU cases to get the overall speedup, but what parts sped up in the GPU case? Which parts actually slowed down?

(The slowdowns are mostly due to the fact that the nonlinear function and Jacobian routines in SNES ex19 do not run on the GPU—verify this by looking at the GPU %F column in the text version of the log—and we had to use fewer ranks in this case. See SNES tutorial ex55 in the main development branch of PETSc for an example where these run on the GPU.)

Running with a different GPU back-end

If you’d like to try another GPU-back end, you can try PETSc’s Kokkos/Kokkos Kernels one. Run with -dm_mat_type aijkokkos -dm_vec_type kokkos:

mpirun -n 1 ./ex19 -da_refine 9 -pc_type mg -mg_levels_pc_type jacobi -pc_mg_log -dm_vec_type kokkos -dm_mat_type aijkokkos -log_view_gpu_time -log_view :log_mg_kokkos_n1.txt

Experimenting with different multigrid cycle types

Our SNES ex19 runs defaulted to using multigrid V-cycles. Try running with W-cycles instead by using the option -pc_mg_cycle_type w.

Unlike V-cycles, W-cycles visit coarse levels many more times than fine ones. What does this do to the time spent in multigrid smoothers for the GPU case vs. the CPU-only one? Should one or the other of these be favored when using GPUs?

An Important Reminder About Cleaning Up Your PETSc Options

IMPORTANT: If you will be doing hands-on lessons from other ATPESC sessions, remember to clear your PETSC_OPTIONS environment variable:

unset PETSC_OPTIONS

Otherwise, you may get unexpected behavior from executables that link against PETSc.

Previous Nonlinear Solvers Lectures

Back to all ATPESC 2022 Lessons

At a Glance

Introduction

The Problem We Are Solving: the driven cavity CFD benchmark

Example 1: Initial exploration and understanding PETSc options

Example -snes_view output

Managing PETSc Options

Example 2: Exact vs. Inexact Newton

Sample output on my laptop (Dell XPS 13 with Intel Core i7-10710U CPU)

What happens to the SNES iteration count? When does the SNES solve diverge?

What yields the shortest execution time?

Example 3: Scaling up the grid size and running in parallel

Sample output for -da_refine 4 case

Sample output for -pc_type mg -da_refine 4 case

Example 4: Increasing the strength of the nonlinearity

Sample output for mpiexec -n 1 ./ex19 -da_refine 2 -grashof 1.3e4

Sample output for mpiexec -n 1 ./ex19 -da_refine 2 -grashof 1.3e4 -pc_type mg

Sample output for mpiexec -n 1 ./ex19 -da_refine 2 -grashof 1.3373e4 -pc_type mg

Sample output for mpiexec -n 1 ./ex19 -da_refine 2 -grashof 1.3373e4 -pc_type lu

Example 5: Nonlinear Richardson Preconditioned with Newton

Sample output

Sample output

Sample output

Example 6: Newton Preconditioned with Nonlinear Richardson

Sample output

Sample output

Sample output

Sample output

Take-Away Messages

Extra Credit: Running Nonlinear Solvers Using PETSc’s GPU Back-Ends

Additional Things To Try

Look at sources of speedup (or slowdown!) for the GPU vs. CPU

Running with a different GPU back-end

Experimenting with different multigrid cycle types

An Important Reminder About Cleaning Up Your PETSc Options

Further Reading

Previous Nonlinear Solvers Lectures

Example `-snes_view` output

Sample output for `-da_refine 4` case

Sample output for `-pc_type mg -da_refine 4` case

Sample output for `mpiexec -n 1 ./ex19 -da_refine 2 -grashof 1.3e4`

Sample output for `mpiexec -n 1 ./ex19 -da_refine 2 -grashof 1.3e4 -pc_type mg`

Sample output for `mpiexec -n 1 ./ex19 -da_refine 2 -grashof 1.3373e4 -pc_type mg`

Sample output for `mpiexec -n 1 ./ex19 -da_refine 2 -grashof 1.3373e4 -pc_type lu`