Linear Solvers
Sparse, Direct Solvers with SuperLU
At A Glance
Questions | Objectives | Key Points |
1. Why use a direct solver? | Understand accuracy | Direct solvers are robust for difficult problems |
2. What effects direct solve performance ? | Understand ordering options | Time & space performance can vary a lot. |
To begin this lesson
- Get into the correct directory
cd track-5-numerical/superlu_mfem
The problem being solved
The convdiff.c application is modeling the steady state convection-diffusion equation in 2D with a constant velocity. This equation is used to model the concentration of something like a die in a moving fluid as it diffuses and flows through he fluid. The equation is as follows: \[\nabla \cdot (\kappa \nabla u) - \nabla \cdot (\overrightarrow{v}u)+R=0\]
Where u is the concentration that we are tracking, \(\kappa\), is the diffusion rate, v is the velocity of the flow and R is a concentration source.
In the application we use here, the velocity vector direction is fixed in the +x direction. However, the magnitude is set by the user (default of 100), \(\kappa\), is fixed at 1.0, and the source is 0.0 everywhere except for a small square centered at the middle of the domain where it is 1.0.
Initial Condition |
---|
Solving this PDE is well known to cause convergence problems for iterative solvers, for larger v. We use MFEM as a vehicle to demonstrate the use of a distributed, direct solver, SuperLU_DIST, to solve very ill-conditioned linear systems.
Running the Example
Run 1: default setting with GMRES solver, preconditioned by hypre, velocity = 100
$ ./convdiff | tail -n 3
Time required for first solve: 0.0342715 (s)
Final L2 norm of residual: 2.43841e-16
Steady State |
---|
Run 2: increase velocity to 1000, GMRES does not converge anymore
$ ./convdiff --velocity 1000 | tail -n 3
Time required for first solve: 0.434585 (s)
Final L2 norm of residual: 0.00095
Between 12 and 13
Below, we plot behavior of the GMRES method for velocity values in the range [100,1000] at increments, dv, of 25 and also show an animation of the solution GMRES gives as velocity increases
Solutions @dv=25 in [100,1000] | Contours of Solution @ vel=1000 |
---|---|
Time to Solution | L2 norm of final residual |
---|---|
GMRES method works ok for low velocity values. As velocity increases, GMRES method eventually crosses a threshold where it can no longer provide a useful result
As instability is approached, more GMRES iterations are required to reach desired norm. So GMRES is still able to manage the solve and achieve a near-zero L2 norm. It just takes more and more iterations. Once GMRES is unable to solve the L2 norm explodes.
Run 3: Now use SuperLU_DIST, with “natural ordering”
$ ./convdiff --velocity 1000 -slu -cp 0
Options used:
--refine 0
--order 1
--velocity 1000
--no-visit
--superlu
--slu-colperm 0
--slu-rowperm 1
--one-matrix
--one-rhs
Number of unknowns: 10201
Nonzeros in L 1040781
Nonzeros in U 1045632
nonzeros in L+U 2076212
nonzeros in LSUB 1040215
** Memory Usage **********************************
** NUMfact space (MB): (sum-of-all-processes)
L\U : 41.12 | Total : 50.74
** Total highmark (MB):
Sum-of-all : 61.17 | Avg : 61.17 | Max : 61.17
**************************************************
**** Time (seconds) ****
EQUIL time 0.00
ROWPERM time 0.01
SYMBFACT time 0.04
DISTRIBUTE time 0.11
FACTOR time 20.68
Factor flops 1.956262e+08 Mflops 9.46
SOLVE time 0.11
Solve flops 5.167045e+06 Mflops 45.99
REFINEMENT time 0.23 Steps 2
**************************************************
Time required for first solve: 21.1867 (s)
Final L2 norm of residual: 1.82335e-18
Stead State For vel=1000 |
---|
Run 4: Now use SuperLU_DIST, with MMD(A’+A) ordering.
$ ./convdiff --velocity 1000 -slu -cp 2
Options used:
--refine 0
--order 1
--velocity 1000
--no-visit
--superlu
--slu-colperm 2
--slu-rowperm 1
--one-matrix
--one-rhs
Number of unknowns: 10201
Nonzeros in L 606806
Nonzeros in U 605547
nonzeros in L+U 1202152
** Memory Usage **********************************
** NUMfact space (MB): (sum-of-all-processes)
L\U : 10.39 | Total : 18.65
** Total highmark (MB):
Sum-of-all : 22.25 | Avg : 22.25 | Max : 22.25
**************************************************
**** Time (seconds) ****
EQUIL time 0.00
ROWPERM time 0.01
COLPERM time 0.04
SYMBFACT time 0.01
DISTRIBUTE time 0.02
FACTOR time 0.05
Factor flops 1.063303e+08 Mflops 2045.75
SOLVE time 0.00
Solve flops 2.367059e+06 Mflops 779.35
REFINEMENT time 0.01 Steps 2
**************************************************
Time required for first solve: 0.114001 (s)
Final L2 norm of residual: 1.77054e-18
NOTE: the number of nonzeros in L+U is much smaller than natural ordering. This affects the memory usage and runtime.
Run 5: Now use SuperLU_DIST, with Metis(A’+A) ordering.
$ ./convdiff --velocity 1000 -slu -cp 4
Options used:
--refine 0
--order 1
--velocity 1000
--no-visit
--superlu
--slu-colperm 4
--slu-rowperm 1
--one-matrix
--one-rhs
Number of unknowns: 10201
Nonzeros in L 518644
Nonzeros in U 521021
nonzeros in L+U 1029464
** Memory Usage **********************************
** NUMfact space (MB): (sum-of-all-processes)
L\U : 9.17 | Total : 17.04
** Total highmark (MB):
Sum-of-all : 21.04 | Avg : 21.04 | Max : 21.04
**************************************************
**** Time (seconds) ****
EQUIL time 0.00
ROWPERM time 0.01
COLPERM time 0.05
SYMBFACT time 0.00
DISTRIBUTE time 0.02
FACTOR time 0.05
Factor flops 5.749964e+07 Mflops 1212.24
SOLVE time 0.00
Solve flops 2.100190e+06 Mflops 637.17
REFINEMENT time 0.01 Steps 2
**************************************************
Time required for first solve: 0.146954 (s)
Final L2 norm of residual: 1.91956e-18
Solutions @dv=25 in [100,1000] | Steady State Solution @ vel=1000 |
---|---|
Run 5.5: Now use SuperLU_DIST, with Metis(A’+A) ordering, using 1 MPI tasks, on a larger problem.
By adding --refine 3
, each element in the mesh is subdivided twice yielding a 64x larger problem.
But, we’ll run it on only one processor.
$ /soft/libraries/mpi/mvapich2-2.2/gcc/bin/mpiexec -n 1 ./convdiff --refine 3 --velocity 1000 -slu -cp 4
Options used:
--refine 3
--order 1
--velocity 1000
--no-visit
--superlu
--slu-colperm 4
--slu-rowperm 1
--slu-parsymbfact 0
--one-matrix
--one-rhs
Number of unknowns: 641601
Nonzeros in L 40059096
Nonzeros in U 40059096
nonzeros in L+U 79476591
** Memory Usage **********************************
** NUMfact space (MB): (sum-of-all-processes)
L\U : 696.35 | Total : 744.90
** Total highmark (MB):
Sum-of-all : 1447.47 | Avg : 1447.47 | Max : 1447.47
**************************************************
**** Time (seconds) ****
EQUIL time 0.03
ROWPERM time 0.29
COLPERM time 4.59
SYMBFACT time 0.31
DISTRIBUTE time 1.46
FACTOR time 10.14
Factor flops 1.720048e+10 Mflops 1695.74
SOLVE time 0.34
Solve flops 1.588700e+08 Mflops 474.11
REFINEMENT time 0.81 Steps 2
**************************************************
Time required for first solve: 18.4967 (s)
Final L2 norm of residual: 5.99574e-18
Run 6: Now use SuperLU_DIST, with Metis(A’+A) ordering, using 12 MPI tasks, on a larger problem.
Here, we’ll re-run the above except on 16 tasks and just grep the output form some key values of interest.
$ /soft/libraries/mpi/mvapich2-2.2/gcc/bin/mpiexec -n 12 ./convdiff --refine 3 --velocity 1000 -slu --slu-colperm 4 | tee run6.out
Options used:
--refine 3
--order 1
--velocity 1000
--no-visit
--superlu
--slu-colperm 4
--slu-rowperm 1
--slu-parsymbfact 0
--one-matrix
--one-rhs
Number of unknowns: 641601
Nonzeros in L 40340620
Nonzeros in U 40340620
nonzeros in L+U 80039639
** Memory Usage **********************************
** NUMfact space (MB): (sum-of-all-processes)
L\U : 705.31 | Total : 974.93
** Total highmark (MB):
Sum-of-all : 2888.58 | Avg : 180.54 | Max : 180.54
**************************************************
**** Time (seconds) ****
EQUIL time 0.03
ROWPERM time 0.36
COLPERM time 5.57
SYMBFACT time 0.37
DISTRIBUTE time 0.30
FACTOR time 1.62
Factor flops 2.301228e+10 Mflops 14226.55
SOLVE time 0.14
Solve flops 1.623936e+08 Mflops 1148.60
REFINEMENT time 0.30 Steps 2
**************************************************
Time required for first solve: 9.14984 (s)
Final L2 norm of residual: 3.49602e-21
We have increased the mesh size by 8x here. The matrix dimension goes up as the SQUARE of the mesh size and this accounts for 64x factor of DOFs. We have also added 12x processors. The parallel runtime is 9.14984 seconds.
Run 7: Now use SuperLU_DIST, solve the systems with same A, but different right-hand side b.
Here, we solve a different linear system but with the same coefficient matrix A. We tell SuperLU to re-use the exisiting LU factors, but only give a different right-hand side. Notice the improvement in solve time when re-using the factors.
$ /soft/libraries/mpi/mvapich2-2.2/gcc/bin/mpiexec -n 12 ./convdiff --refine 3 --velocity 1000 -slu -cp 4 -2rhs
Options used:
--refine 3
--order 1
--velocity 1000
--no-visit
--superlu
--slu-colperm 4
--slu-rowperm 1
--slu-parsymbfact 0
--one-matrix
--two-rhs
Number of unknowns: 641601
Nonzeros in L 40340620
Nonzeros in U 40340620
nonzeros in L+U 80039639
nonzeros in LSUB 15901421
** Memory Usage **********************************
** NUMfact space (MB): (sum-of-all-processes)
L\U : 705.31 | Total : 974.93
** Total highmark (MB):
Sum-of-all : 2888.58 | Avg : 180.54 | Max : 180.54
**************************************************
Time required for first solve: 9.11672 (s)
Final L2 norm of residual: 2.14235e-39
**************************************************
**** Time (seconds) ****
EQUIL time 0.04
ROWPERM time 0.36
COLPERM time 5.64
SYMBFACT time 0.38
DISTRIBUTE time 0.23
FACTOR time 1.61
Factor flops 2.301228e+10 Mflops 14307.11
SOLVE time 0.14
Solve flops 1.623936e+08 Mflops 1147.81
REFINEMENT time 0.30 Steps 2
**************************************************
Time required for second solve (new rhs): 0.46439 (s)
Final L2 norm of residual: 1.95236e-39
SOLVE time 0.14
Solve flops 1.623936e+08 Mflops 1202.77
REFINEMENT time 0.29 Steps 2
**************************************************
Out-Brief
In this lesson, we have used MFEM as a vehicle to demonstrate the value of direct solvers from the SuperLU_DIST numerical package.
Further Reading
To learn more about sparse direct solver, see Gene Golub SIAM Summer School course materials: Lecture Notes, Book Chapter, and Video