Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 14-Day Trial for You or Your Team.

Learn More →

Newly Released Capabilities in the Distributed-Memory SuperLU Sparse Direct Solver

Newly Released Capabilities in the Distributed-Memory SuperLU Sparse Direct Solver Newly Released Capabilities in Distributed-memory SuperLU Sparse Direct Solver XIAOYE S. LI, PAUL LIN, and YANG LIU, Lawrence Berkeley National Laboratory, USA PIYUSH SAO, Oak Ridge National Laboratory, USA We present the new features available in the recent releaseSuperLU_DIST of , Version 8.1.1.SuperLU_DIST is a distributed- memory parallel sparse direct solver. The new features include (1) a 3D communication-avoiding algorithm framework which trades of inter-process communication for selective memory duplication, (2) multi-GPU support for both NVIDIA GPUs and AMD GPUs, and (3) mixed precision routines that perform single precision LU factorization and double precision iterative reinement. Apart from the algorithm improvements, we also modernized the software build system to use CMake and Spack package installation tools to simplify the installation procedure. Throughout the paper, we describe in detail the pertinent performance-sensitive parameters associated with each new algorithmic feature, show how they are exposed to the users, and give general guidance of how to set these parameters. We illustrate that the solvers performance both in time and memory can be greatly improved after systematic tuning of the parameters, depending on the input sparse matrix and underlying hardware. CCS Concepts: · Mathematics of computing→ Solvers. Additional Key Words and Phrases: Sparse direct solver, communication-avoiding, GPU, mixed-precision 1 OVERVIEW OF SUPERLU_DIST 1,2 SuperLU_DIST is a distributed-memory parallel sparse direct solver library for solving large sets of linear equations�� = � [6]. Here � is a square, nonsingular�,× � sparse matrix, and� and � are dense � × ��ℎ� matrices, where��ℎ�is the number of right-hand sides and solution vectors. The matrix � need not be symmetric or deinite; indeed,SuperLU_DIST is particularly appropriate for unsymmetric matrices, and it respects both the unsymmetric values as well as the unsymmetric sparsity pattern. The library uses variations of Gaussian elimination (LU factorization) optimized to take advantage of the sparsity of the matrix and modern high performance computer architectures (speciically memory hierarchy and parallelism). It is implemented in ANSI C, using MPI for communication, OpenMP for multithreading, and CUDA (or HIP) for NVIDIA (or AMD) GPUs. The library includes routines to handle both real and complex matrices in single and double precisions, and some functions with mixed precisions. The distributed-memory parallel algorithm consists of the following major steps. (1) Preprocessing (2) Sparse LU factorization (SpLU) https://github.com/xiaoyeli/superlu_dist https://portal.nersc.gov/project/sparse/superlu/superlu_dist_code_html/index.html Authors’ addresses: Xiaoye S. Li, xsli@lbl.gov; Paul Lin, paullin@lbl.gov; Yang Liu, liuyangzhuan@lbl.gov, Lawrence Berkeley National Laboratory, One Cyclotron Road, Berkeley, CA, USA, 94720; Piyush Sao, saopk@ornl.gov, Oak Ridge National Laboratory, Oak Ridge, TN, USA, 37830. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for proit or commercial advantage and that copies bear this notice and the full citation on the irst page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior speciic permission and/or a fee. Request permissions from permissions@acm.org. © 2022 Association for Computing Machinery. 0098-3500/2022/12-ART $15.00 https://doi.org/10.1145/3577197 ACM Trans. Math. Softw. 2 • Xiaoye S. Li, Paul Lin, Yang Liu, and Piyush Sao (3) Sparse triangular solutions (SpTRSV) (4) Iterative reinement (IR) (optional) The preprocessing in Step (1) transforms the original linear system �� = � into�� = �, so that the latter one has more favorable numerical properties and sparsity structures.SuperLU_DIST In , typically � is irst transformed into� = � � � �� � . Here � and � are diagonal scaling matrices to equilibrate the system, which tends to � � � � � � reduce condition number of the matrix and avoid over/underlow during Gaussian elimination. � and � are � � permutation matrices . The role of� is to permute rows of the matrix to make diagonal elements large relative to the of-diagonal elements (numerical pivoting). The role � is ofto permute rows and columns of the matrix to minimize the ill-in in�the and � factors (sparsity reordering). Note that we apply � symmetrically so that the large diagonal entries remain on the diagonal. With these transformations, the linear system to be solved is: � −1 (� � � �� � )(� � )� = � � � �. In the software coniguration, each transformation can be turned of, or � � � � � � � � � � can be achieved with diferent algorithms. Further algorithm details and user interfaces can be found 6, 8].in [ After these transformations, the last preprocessing step is symbolic factorization which computes the distributed nonzero structures of the� and � factors, and distributes the nonzeros of � into� and � . In Step (2), before the new Version-7 release (2021), the distributed memory code had been largely built upon the design in the irstSuperLU_DIST paper [7]. The main ingredients of the parallel SpLU algorithm are: • supernodal fan-out (right-looking) based on elimination DAGs, • static pivoting with possible half-precision perturbations on the diagonal [7], • 2D logical process arrangement for non-uniform block-cyclic mapping, based on the supernodal block partition, and • loosely synchronous scheduling with lookahead pipelining [14]. In Step (3), The parallel SpTRSV uses a block-cyclic layout for� the and � matrices as in the results of SpLU. It also uses a message-driven asynchronous and dynamically schedule algorithmÐdesigne d d to reduce the communication and latency costs. In Step (4), the user can optionally invoke a few steps of iterative reinement to improve the solution accuracy. The computational kernels in the IR are SpTRSV and sparse matrix-vector multiplication (SpMV). Before this release, the iterative reinement is performed in the same precision as that of SpLU and SpTRSV. This release paper focuses on the new capabilities in Steps (2)-(4). These includes the new 3D communication- avoiding algorithm framework (Section 2) , multi-GPU support (Section 3), mixed precision routines (Section 4), and support for new build tools (Appendix C). Throughout the paper, we discuss all the parameters that may inluence the code performance. These parameters can be set in a compile-time "options" structure, or by environment variables (with capitalized names), the latter of which take precedence. Section 5 gives a summary of the parameters and some tuning results. In particular, we illustrate that the performance can be greatly improved by using an autotuner GPTune 9][for an optimal parameters setting. Finally, Section 6 summarizes our contributions and gives perspectives of future developments. 2 3D COMMUNICATION-AVOIDING ROUTINES We developed a novel 3D algorithm framework for sparse factorization and triangular solutions. This new approach is motivated by the strong scaling requirement from exascale applications. Our novel 3D algorithm framework for sparse factorization and triangular solutions alleviates communication costs by taking advantage of the three-dimensional MPI process grid, the elimination tree parallelism, and the communication-memory tradeofÐinspired from communication-avoiding algorithms for dense linear algebra in the last decade. The 3D processes grid, conigured as � = � × � × � (see Fig. 3a), can be considered as � sets of 2D � � � � processes layers. The distribution of the sparse matrices is governed by the supernodal elimination tree-forest (etree-forest): the standard etree is transformed into an etree-forest which is binary atlog the(top � ) levels and ACM Trans. Math. Softw. SuperLU_DIST Release version-8 • 3 has � subtree-forests at the leaf level (see Fig. 1a). The description of the tree partition and mapping algorithm is described in 12, [Section 3.3]. The matrices � , �, and � corresponding to each subtree-forest are assigned to one 2D process layer. The 2D layers are referred to as Grid-0, Grid-1, . . ., up to (� − 1) grids. Fig. 1b shows the submatrix mapping to the four 2D process grids. (a) 2-level etree partition (b) Matrix view on 4 process grids A 0 0 0 0 0 0 0 0 0 A 1 2 1 1 1 2 2 3 3 A 4 6 A 4 5 6 3 A 5 4 5 Grid-0 Grid-1 Grid-2 Grid-3 Fig. 1. Illustration of the 3D parallel SpLU algorithm with 4 process grids. Note that,�herrefers e to�[�:, �:] typedef struct { MPI_Comm comm; /* MPI communicator */ superlu_scope_t rscp; /* row scope */ superlu_scope_t cscp; /* column scope */ superlu_scope_t zscp; /* scope in third dimension */ gridinfo_t grid2d; /* for using 2D functions */ int iam; /* my process number in this grid */ int nprow; /* number of process rows */ int npcol; /* number of process columns */ int npdep; /* number of replication factor in Z-dimension */ int rankorder; /* = 0: Z-major ( default ) * = 1: XY-major (need set environment variable: SUPERLU_RANKORDER=XY) */ } gridinfo3d_t; Fig. 2. 3D process grid definition. An example for calling the 3D algorithm to solve a sparse linear system is provided by the sample program EXAMPLE/pddrive3d.c, see https://github.com/xiaoyeli/superlu_dist/blob/master/EXAMPLE/pddrive3d.c. Fig. 2 shows the C structure deining the 3D process grid. 2.1 The 3D Process layout and its performance impact A 3D process grid can be arranged in two formats: �� -major or � -major, see Fig. 3. In�� -major format, processes with the same�� -coordinate and diferent � -coordinate have consecutive global ranks. Consequently, when spawning multiple processes on a node, the spawned processes will have �the � same coordinate (except for cases where � is not a multiple of the number of processes spawned on the node). Alternatively, we can arrange the 3D process grid in Z-major format where processes with the same � coordinate have consecutive global ranks. This is the default ordering SuperLU_DIST in . ACM Trans. Math. Softw. 4 • Xiaoye S. Li, Paul Lin, Yang Liu, and Piyush Sao The � -major format can be better for performance as it keeps processes in a 2D grid closer. Hence it may provide higher bandwidth for 2D communication, typically the bottleneck in communication. On the other hand, the �� -major format can be helpful when using GPU acceleration. This can happen since the XY-major ordering will keep more GPUs active during ancestor factorizations. In some cases, e.g., sparse matrices from non-planar graphs, ancestor factorization can become compute dominant, and XY-major ordering helps by keeping more GPUs active. For example, on 16 Haswell nodes of the NERSC Cori Cray XC40, the Z-major ordering was×0.85-1.3 faster than the XY major ordering. Haswell compute nodes have dual-socket 16-core 2.3 GHz Intel Xeon E5-2698v3 CPUs. Note that this performance diference is system-dependent, depending on the hardware topology as well as the job scheduler policy of the parallel machine. (a) 3D process grid (b) łZž and łXYž major ordering 0 1 2 0 2 4 9 10 11 1 3 5 3 4 5 6 4 5 12 13 14 7 9 11 y 6 7 8 12 7 8 15 16 17 13 15 17 Z-major Grid XY-major Grid Fig. 3. A logical 3D process grid and process configuration for two types of process arrangements. The linear solver driver routine pdgssvx3d is , with the calling API explained here: https://portal.nersc.gov/ project/sparse/superlu/superlu_dist_code_html/pdgssvx3d_8c.html The SpLU factorization progresses from leaf�le = log vel� to the root level 0. The two main phases are local factorization and Ancestor-Reduction. (1) Local factorization.In parallel and independently, every 2D process grid performs � factorization the 2 of its locally owned submatrix � . This of is the same algorithm as the one before Version-7 14]. The [ only diference is that each process grid will generate a partial Schur complement update, which will be summed up with the partial updates from the other process grids in the next phase. (2) Ancestor-Reduction.After the factorization of le �, w vel- e reduce the partial Schur complement of the ancestor �−�+1 nodes before factorizing the next level.�In -ththe level’s reduction, the receiver is �2 the -th process �−� grid and the sender is the(2�+ 1)2 -th process grid, for some integer �. The process in the 2D grid which owns a block� has the same (�,�) coordinate in both sender and receiver grids. So communication in the �,� ancestor-reduction step is point-to-point pair-wise and takes places along �-axisthe in the 3D process grid. We analyzed the asymptotic improvements for planar graphs (e.g., those arising from 2D grid or mesh discretizations) and certain non-planar graphs (speciically for 3D grids and meshes). For a planar graph with � vertices, our algorithm reduces communication volume asymptotically � by a factor in ofO log� and latency by a factor ofO (log�). For non-planar cases, our algorithm can reduce the per-process communication volume by× 3 and latency byO � times. In all cases, the extra memory needed to achieve these gains is a small constant factor of �the and � memory. We implemented our algorithm by extending the 2D data structure ACM Trans. Math. Softw. SuperLU_DIST Release version-8 • 5 used inSuperLU. Our new 3D code achieves empirical speedups up×tofor 27 planar graphs and up to 3.3× for non-planar graphs over the baselineSuperLU 2D when run on 24,000 cores of a Cray XC30 (Edison at NERSC). Please see [12] for comprehensive performance tests with a variety of real-world sparse matrices. Remark. The algorithm structure requires that the z-dimension of the 3D process � grid must be a power-of- two integer. There is no restriction on the shape of the 2D�grid and � . The rule of thumb is to deine it as � � square as possible. When square grid is not possible, it is better to set the row dimension � slightly smaller than the column dimension � . For example, the following are good options for the 2D grid: 2x3, 2x4, 4x4, 4x8. Inter-grid Load-balancing in the 3D SpLU Algorithm. The 3D algorithm provides two strategies for partitioning the elimination tree to balance the load between diferent 2D grids. SUPERLU_LBS The environment variable speciies which one to use. • Nested Dissection (ND) strategy uses the partitioning provided by a nested dissection ordering. It works well for regular grids. The ND strategy can only be used when the elimination tree is binary, i.e., when the column order is also ND, and it cannot handle cases where the separator tree has nodes with more than two children. • Greedy Heuristic (GD) strategy uses a greedy algorithm to divide one level of the elimination tree. It seeks to minimize the maximum load imbalance among the children of that node; if the imbalance in children is higher than 20%, it further subdivides the largest child until the imbalance falls below 20%. The GD strategy works well for arbitrary column ordering and can handle irregular graphs; however, if it is used on heavily imbalanced trees, it leads to bigger ancestor sizes and, therefore, more memory than ND. GD strategy is the default strategy unless SUPERLU_LBS=ND is speciied. In summary, two parameters are speciic to the 3D SpLU algorithm: • superlu_rankorder (SUPERLU_RANKORDER) deines the arrangement of the 3D process grid (default is Z-major); • superlu_lbs (SUPERLU_LBS) deines the inter-grid load-balancing strategy (default is GD). 3 GPU-ENABLED ROUTINES In the current release, the SpLU factorization routines can oload certain computations to GPUs, which is mostly in each Schur complement update step. We support both NVIDIA and AMD GPUs. We are actively developing code for the Intel GPUs. To enable GPU oloading, irst a compile-time CMake variable needs to be deined: -DTPL_ENABLE_CUDALIB=TRUE (for NVIDIA GPU with CUDA programming)-DTPL_ENABLE_HIPLIB=TRUE or (for AMD GPU with HIP programming). Then, a runtime environment variable SUPERLU_ACC_OFFLOAD is used to control whether to use GPU or not. By default, SUPERLU_ACC_OFFLOAD=1 is set. (‘ACC’ denotes ACCelerator.) 3.1 2D SpLU GPU algorithm and tuning parameters The irst sparse LU factorization algorithm capable of oloading the matrix-matrix multiplication to the GPU was published in 11].[The panel factorization and the Gather/Scatter operations are performed on the CPU. This algorithm has been availableSuperLU_DIST since version 4.0 of the code (October 2014); however, many users are uncertain about using it correctly due to limited documentation. This section provides a gentle introduction to GPU acceleration SuperLU_DIST in and its performance tuning. Performing the Schur complement update requires some temporary storage to hold dense blocks. In an earlier algorithm, at each elimination step, the Schur complement update is performed block by block. After performing updates on a block, the temporary storage can be reused for the next block. A conspicuous advantage of this approach is its memory eiciency, since the temporary storage required is bounded by maximum block size. The maximum block size is a tunable parameter that trades of local performance of matrix-matrix multiplication ACM Trans. Math. Softw. 6 • Xiaoye S. Li, Paul Lin, Yang Liu, and Piyush Sao (GEMM) with inter-process parallelism. A typical setting for the maximum block size is 512 (or smaller). However, a noticeable disadvantage of this approach is that it fails to fully utilize the abundance of local ine-grained parallelism provided by GPUs because each GEMM is too small. In [11], we modiied the algorithm in the Schur complement update step. At each �, w step e irst copy the individual blocks (in skyline storage) �th blo in ck therow of � into a consecutive bufer U(�, :). The �(:, �) is already in consecutive storage thanks to the supernodal structure. We then perform a single GEMM call to compute � ← �(:, �) × � (�, :). The matrix� is preallocated and the size � ne ofeds to be suiciently large to achieve close to peak GEMM performance. If the size �(:of , �) × � (�, :) is larger than � , then we partition the product into several large chunks such that each chunk requires temporary storage smaller � . Giv thanen that modern GPUs have considerably more memory than earlier generations, this extra memory can enable a much faster runtime. Now, each step of the Schur complement update consists of the following substeps: (1) Gather sparse blocks� (�, :) into a dense BLAS compliant bufer U(�, :); (2) Call dense GEMM � ← �(:, �) × U(�, :) (leading part on CPU, trailing part on GPU); and (3) Scatter � [] into the remaining (�+1 : �, �+1 : � ) sparse � and � blocks. It should be noted that the Scatter operation can require indirect memory access, and therefore, it can be as expensive as the GEMM cost. The Gather operation, however, has a relatively low overhead compared to other steps involved. The GEMM oload algorithm tries to hide the overhead of Scatter and data transfer between the CPU and GPU via software pipelining. Here, we discuss the key algorithmic aspects of the GEMM oload algorithm: • To keep both the CPU and GPU busy, we divide the U(�, :) into a CPU part and GPU part, so that the GEMM call is split into [ cpu : gpu�](parts: :, �) × U(�,[���]) and �(:, �) × U(�,[���]). To hide the data transfer cost, the algorithm further divides the GEMM into multiple streams. Each stream performs its own sequence of operations: CPU-to-GPU transfer, GEMM, and GPU-to-CPU transfer. Between these streams, these operations are asynchronous. The GPU matrix multiplication is also pipelined with the Scatter operation performed on the CPU. • To ofset the memory limitation on the GPU, we devised an algorithm to divide the Schur complement update into smaller chunks[as ��{� : ���] | [��� : ���] | . . . }. These chunks depend on the available 1 2 memory on the GPU and can be sized by the user. A smaller chunk size will result in many iterations of the loop. There are three environment variables that can be used to control the memory usage and performance in the GEMM oload algorithm: • superlu_n_gemm (SUPERLU_N_GEMM) is the minimum value of the pro�duct �� for a GEMM call to be worth oloading to the GPU (default is 5000); • superlu_num_gpu_streams (SUPERLU_NUM_GPU_STREAMS) deines the number of GPU streams to use (de- fault is 8); and • superlu_max_buffer_size (SUPERLU_MAX_BUFFER_SIZE) deines the maximum bufer size on GPU that can hold the GEMM output matrix � (default is 256M in loating-point words). This simple GEMM oload algorithm has limited performance gains. We observed a×rsp oughly eedup 2-3 over the CPU-only code for a range of sparse matrices. 3.2 3D SpLU GPU algorithm and tuning parameters We extend the 3D algorithm for heterogeneous architectures by adding the Highly Asynchronous Lazy Oload (Halo) algorithm for co-processor oload 10].[ Compared to the GPU algorithm in the 2D code ( Section 3.1), ACM Trans. Math. Softw. SuperLU_DIST Release version-8 • 7 this algorithm also oloads the Scatter operations of each Schur complement update step to the GPU (in addition to the GEMM call). On 4096 nodes of a Cray XK7 (Titan at ORNL) with 32,768 CPU cores and 4096 Nvidia K20x GPUs, the 3D algorithm achieves empirical speedups up × for to 24 planar graphs and 3.5× for non-planar graphs over the baseline 2D SuperLU with co-processor acceleration. The performance related parameters are: • superlu_num_lookaheads (SUPERLU_NUM_LOOKAHEADS), number of lookahead levels in the Schur-complement update (default is 10). In order to reduce the critical path of the sequence of panel factorizations, we softwar deviseedpip a elining method to overlap the panel factorization of the processes at � +step 1 with the Schur complement update of the other processes at step�. When there are multiple remaining supernodes in the Schur complement, the lookahead window (i.e., pipeline depth) can be greater14 than ]. This 1 [ environment variable deines the width of the lookahead window. • superlu_mpi_process_per_gpu (MPI_PROCESS_PER_GPU) (default is 1). The Halo algorithm uses GPU memory based on its availability. To do this correctly, the algorithm needs to know how many MPI processes are running on a GPU, which can be diicult to determine on some systems. This environment variable can be set to inform SuperLU_DIST that there are N ranks on each GPU so that it can limit its memory usage of each GPU to 90% of available memory shared among all MPI processes, which will, in turn, limit the amount of memory used by each rank. 3.3 2D SpTRSV GPU algorithm When the 2D grid has one MPI rank, SpTRSV in SuperLU_DIST is parallelized using OpenMP for shared-memory processors and CUDA or HIP for GPUs. Both versions of the implementations are based on an asynchronous level- set traversal algorithm that distributes the computation workload across CPU threads and GPU threads/blo 4]. cks [ The CPU implementation uses OpenMP taskloops and tasks for dynamic scheduling, while the GPU implementa- tion relies on static scheduling. Fig. 4a shows the performance of SpTRSV (L and U solves) on 1 Cori Haswell node with 1 and 8 OpenMP threads with a number of matrices. The speedup ranges between 1.4 and 4.3. Fig. 4b shows the performance of L-solve using SuperLU_DIST (8 ORNL Summit IBM POWER9 CPU cores or 1 Summit V100 GPU) and cuSPARSE (1 Summit V100 GPU). The GPU SpTRSV SuperLU_DIST in consistently outperforms cuSPARSE and is comparable to the 8-core CPU results. Here we choose 8 CPU cores as there are on average 7 CPU cores per GPU on Summit, and 8 is the closest power of 2 number. Note that GPU performance of the U-solve requires major improvements and is not available in the current release. That said, we compare the performance of SpTRSV (both L and U solves) on one Summit node using three conigurations: 1. (baseline) 1-core L solve and 1-core U solve, 2. (GPU) 1-GPU L solve and 1-core U solve, and 3. (GPU+OpenMP) 1-GPU L solve and 8-core U solve. The speedups comparing to the baseline coniguration are shown in Table 1. When the 2D grid has more than 1 MPI rank, SpTRSV also supports OpenMP parallelism with less speedups. In addition, the multi-GPU SpTRSV SuperLU_DIST in is under active development and will be available in future releases. The number of OpenMP threads can be controlled by the environment variable OMP_NUM_THREADS, and the GPU SpTRSV can be turned on with the -DGPU_SOLVE compiler lag. The user needs to make sure that only 1 MPI rank is used for the 2D grid when GPU SpTRSV is employed. 4 MIXED-PRECISION ROUTINES SuperLU_DIST has long supported four distinct loating-point types: IEEE FP32 real and complex, IEEE FP64 real and complex. Furthermore, the library allows all four datatypes to be used together in the same application. ACM Trans. Math. Softw. 8 • Xiaoye S. Li, Paul Lin, Yang Liu, and Piyush Sao (a) L and U solve (in Mflops) with 1 and 8 OpenMP threads on Cori Haswell (b) L solve ofSuperLU and cuSparse on CPU and GPU 2.5 CUSPARSE 1 OpenMP thread 8 OpenMP threads GPU SuperLU CPU(8-core) SuperLU 1.5 0.5 Fig. 4. Performance of SpTRSV with 1 MPI rank for a variety of sparse matrices. copter2 epb3 gridgenavanbody shipsec1dawson5 GPU vs. Baseline 1.6 1.7 1.6 1.6 1.54 1.6 GPU+OpenMP vs. Baseline 5.3 5.7 5.3 4.4 4.1 5.2 Table 1. Speedup of GPU SpTRSV compared with sequential CPU SpTRSV. Recent hardware trends have motivated increased developmentmixe of d-precision numerical libraries, mainly because hardware vendors have started designing special-purpose units for low precision arithmetic with higher speed. For direct linear solvers, a well understood method is to use lower precision to perform factorization (expensive) and higher precision to perform iterative reinement (IR) to recover accuracy (cheap). For a typical sparse matrix resulting from the 3D inite diference discretization of a regular mesh, theOSpLU � lops needs 4/3 while each IR step needs only O � lops (including SpTRSV and SpMV). For dense LU and QR factorizations, the beneit of lower precision format comes mainly from accelerated GEMM speed. But in the sparse case, the dimensions of the GEMM are generally smaller and of non-uniform size throughout factorization. Therefore, the speed gain from GEMM alone is limited. In addition to GEMM, a nontrivial cost is the Scatter operation. From our tests, we found that the fraction of time in GEMM usually is less than 50%. Because of this, the TensorCore version of GEMM calls led to a less than 5% speedup for the whole SpLU. When comparing FP32 with the FP64 versions of SpLU, we observed about 50% speedup with the FP32 version. We recall the IR algorithm using three precisions in Algorithm 2, 3]. This1algorithm [ is already available as xGERFSX functions in LAPACK, where the input matrix is dense and so is LU. The following three precisions may be used: • � is the working precision; it is used for the input � anddata �, and output �. (�) 2 • � is the precision for the computed solution � . We require� ≤ � , possibly � ≤ � if necessary for � � � � componentwise convergence. (�) 2 • � is the precision for the residuals � . We usually hav�e≤ � , i.e., at least twice the working precision. � � ACM Trans. Math. Softw. globalmat118_1536 matrix05 copter2 epb3 gridgena vanbody shipsec1 dawson5 gas_sensor rajat16 Gflops matrix05 copter2 gas_sensor xenon2 shipsec1 xenon1 g7jac160 g7jac140sc mark3jac100sc ct20stif vanbody dawson5 gridgena epb3 torso2 torsion1 rajat16 hcircuit SuperLU_DIST Release version-8 • 9 (a) audikw_1 convergence history (b) Ga19As19H2 convergence history Fig. 5. Convergence history of Algorithm 1 when applied to two sparse linear systems. The vertical line in each plot −24 corresponds to the IR steps taken when our stopping criteria are satisfied. Here, the working precision � = is 2 . The blue −53 lines are all single. The red lines correspond to� = � = 2 , but factorization is in single. � � Algorithm 1 Three-precisions Iterative Reinement (IR) for Direct Linear Solvers (1) 1: Solve�� = � using the basic solution method (e.g., LU or QR) ⊲ (� ) 2: �= 1 3: repeat (�) (�) 4: � ← �− �� ⊲ (� ) (�+1) (�) 5: Solve� �� = � using the basic solution method ⊲ (� ) (�+1) (�) (�+1) 6: Update � ← � + �� ⊲ (� ) 7: �← �+ 1 (�) 8: until � is łaccurate enoughž (�) 9: return � and error bounds Algorithm 1 converges with small normwise (or componentwise) error and error bound if the normwise (or def componentwise) condition number�of does not exceed 1/(�� ), where � = ��� (���(�(�,:))). Moreover, � � this IR procedure can return to the user both normwise and componentwise reliable error bounds. The error analysis in [2] should carry through to the sparse cases. We implemented AlgorithmSuperLU_DIST 1 in , using two precisions in IR: −24 −53 • � = 2 (IEEE-754 single precision), � = � = 2 (IEEE-754 double precision) � � � In Figure 5, the left two plots show the convergence history of two systems, in both normwise forward and backward errors, � and � , respectively (deined below). We perform two experiments: one using single ��� ��� precision IR, the other using double precision IR. As can be seen, single precision IR does not �redu,ce much ��� while double precision ��delivers � close to� . The IR time is usually under 10% of the factorization time. ��� � Overall, the mixed-precision speed is still faster than using pure FP64, see Table 2. The 2D driver routine for this mixed-precision appr psgssvx_d2 oach is , where the suix "d2" denotes that the intermediate � vector and � vector internal to the IR routine are carried in double precision. ACM Trans. Math. Softw. 10 • Xiaoye S. Li, Paul Lin, Yang Liu, and Piyush Sao Table 2. Parallel solution time (seconds) (including SpLU and IR): purely double precision, purely single precision, and mixed precision (FP32 SpLU + FP64 IR). ORNL Summit using up to 8 nodes, each node uses 6 CPU Cores (C) and 6 GPUs (G). Matrix Precision6 C+G 24 C+G 48 C+G Matrix Precision6 C+G 24 C+G 48 C+G audikw_1 Double 65.9 21.1 18.9 Ga19As19H42 Double 310.9 62.4 34.3 Single 45.8 13.8 10.5 Single 258.1 48.2 25.8 Mixed 49.2 13.9 11.4 Mixed 262.8 48.8 26.1 The only diference from the one-precision routine psgssvx is the output arrayerr_bounds[] (error bounds). For each right-hand side, we return the following three quantities: (�+1) (�) (�) ∥�� ∥ ∞/∥� ∥ ∥� −�∥ ∞ ∞ • err_bounds[0]: normwise forward error bound: � = max ,�� ≈ ���� � 1−� ∥�∥ ��� ∞ (�+1) def ∥�� ∥ ∞ (�) where, � = max is the estimate of the convergence rate�of. max �≤� (�) ∥�� ∥ (�) −1 (�) � −� ∥� �� ∥ � ∞ � • err_bounds[1]: componentwise forward error bound: max ,�� ≈ max � � 1−�ˆ � ��� � (�+1) ∥��� ∥ ∞ −1 (�) where, � = ����(�), �ˆ = max is the estimate of the convergence rate�of� max �≤� (�) ∥��� ∥ (�) |�−�� | • err_bounds[2]: componentwise backward error: max (�) (|�||� |+|�|) 5 SUMMARY OF TUNING PARAMETERS AND TUNING RESULTS Throughout all phases of the solution process, a number of algorithm parameters can inluence the solver’s performance. These parameters can be modiied by the user. For each user-callable routine, the irst argument is usually an input options " " argument, which points to the structure containing a number of algorithm choices. These choices are determined at compile time. The second column in Table 3 lists the namedoptions ields in the argument. The fourth column lists all the possible values and their corresponding C’s enumerated constant names. The user should call the following routine to set up the default values. superlu_dist_options_t options; set_default_options_dist(&options); After setting the defaults, the user can modify each default, for example: options.RowPerm = LargeDiag_HWPM; For a subset of these parameters, the user can change them at runtime via environment variables. These parameters are listed in the third column in Table 3. At various places of the code, an environment inquiry function SRC/sp_ienv.c is called to retrieve the values of the environment variables. Two algorithm blocking parameters can be changed at runtime: SUPERLU_MAXSUP and SUPERLU_RELAX. SUPERLU_MAXSUP sets the maximum size of a supernode. That is, if the number of columns in a supernode exceeds this value, we will split this supernode into two supernodes. Setting this parameter to a large value results in larger blocks and generally better performance for threaded and GPU GEMM. Increasing it limits the number of available parallel tasks across MPI processes. Figure 6a illustrates how performance, as measured in Glops, varies with SUPERLU_MAXSUP on a single node of Cori Haswell when using 32 OpenMP threads. For smaller matrices, such as this one (torso3), performance is near its peak when SUPERLU_MAXSUP equals 128, which is over×50 faster than when this value is set to 4. However, above this value, the performance starts to taper of. SUPERLU_RELAX is a relaxation parameter: if the number of nodes (columns) in a subtree of the elimination tree is less than this value, this subtree is treated as one supernode, regardless of the row structures. That means, we pad explicit zeros to enforce that all the columns within this relaxed supernode have the same row structure. The advantage of this padding is to mitigate many small supernodes at the bottom of the elimination tree. On ACM Trans. Math. Softw. SuperLU_DIST Release version-8 • 11 (a) (b) torso3-OMP_NUM_THREADS=32,Intel-Haswell Mem(L + U) Mem(IDX) Factorization Time 2 108 102 11 100 100 2 60 2 8 2 4 8 16 32 64 128 256 512 4 8 16 32 64 128 256 Max Supernode Size SUPERLU_RELAX Fig. 6. Impact of maximum supernode size (SUPERLU_MAXSUP) and supernodal relaxation (SUPERLU_RELAX) on perfor- mance and memory. The machine is NERSC Cori Haswell node. The matrix is torso3 from SuiteSparse. the other hand, a large value SUPERLU_RELAX of may introduce too many zeros which in turn propagate to the ancestors of the elimination tree, resulting in a large number of ill-ins in the L and U factors. Figure 6b shows the impact of this parameter on the memory use (left axis) and factorization time. A value of 32 or 64 represents a good tradeof between memory and time. The optimal settings of these parameters are matrix-dependent and hardware-dependent. Additionally, several other parameters and environment variables listed in Table 3 are performance critical for the 2D and 3D, CPU and GPU algorithms described in Sections 2, 3.1 and 3.2. It is a daunting task for manual tuning to ind the optimal setting of these parameters. Now in Sections 5.1 - 5.3 we show how an autotuner can signiicantly simplify this task. Here we leverage an autotuner called GPTune 9] to [ tune the performance (time and memory) of SpLU. We consider two example matrices from the Suitesparse matrix collection, G3_circuit from circuit simulation and H2O from quantum chemistry simulation. For all the experiments, we consider a two-objective tuning scenario and generate a Pareto front from the samples demonstrating the tradeof between memory and CPU requirement of SpLU. 5.1 3D CPU SpLU parameter tuning For the 3D CPU SpLU algorithm (2), we use 16 NERSC Cori Haswell nodes and the G3_circuit matrix. The number of OpenMP threads is set to 1, so there are a total�of� � = 512 MPI ranks. We consider the following tuning � � � parameters [ SUPERLU_MAXSUP, SUPERLU_RELAX, num_lookaheads,� , � ]. We set up GPTune to generate 100 � � samples. All samples and the Pareto front are plotted in Fig. 7a. The samples on the Pareto front and the default one are shown in Table 4, one can clearly see that by reducing the computation granularity SUPERLU_MAXSUP ( , SUPERLU_RELAX) and increasing � , one can signiicantly improve the SpLU time while using slightly more memory. 5.2 2D GPU SpLU parameter tuning For the 2D GPU SpLU algorithm (3.1), we use 2 NERSC Cray EX Perlmutter GPU compute nodes with 4 MPI ranks per node and the H2O matrix. Perlmutter GPU compute nodes consist of a single 64-core 2.45 GHz AMD EPYC 7763 CPU and four NVIDIA A100 (40GB HBM2) GPUs. The number of OpenMP threads is set to 16. We consider the following tuning parameters [ ColPerm, SUPERLU_MAXSUP, SUPERLU_RELAX, SUPERLU_N_GEMM , SUPERLU_MAX_BUFFER_SIZE, � ]. We set up GPTune to generate 100 samples. All samples and the Pareto front ACM Trans. Math. Softw. Performance in GFLOP/sec % Memory relative to SUPERLU_RELAX=4 Factorization time in seconds 12 • Xiaoye S. Li, Paul Lin, Yang Liu, and Piyush Sao Table 3. List of algorithm parameters used in various phases of the linear solver. The third column lists the environment variables that can be reset at runtime. parameters must be set inoptions{} the structure input to a driver routine. phase options env variables values in 2D or 3D algo. (compile-time) (runtime) (enum constants) Pre- Equil NO, YES (default) 2d, 3d process RowPerm 0: NOROWPERM 2d, 3d 1: LargeDiag_MC64 (default) 2d, 3d 2: LargeDiag_HWPM 2d, 3d 3: MY_PERMR 2d, 3d ColPerm 0: NATURAL 2d, 3d 1: MMD_ATA 2d, 3d 2: MMD_AT_PLUS_A 2d, 3d 3: COLAMD 2d, 3d 4: METIS_AT_PLUS_A (default) 2d, 3d 5: PARMETIS 2d, 3d 6: ZOLTAN 2d, 3d 7: MY_PERMC 2d, 3d ParSymbFact YES, NO (default) 2d, 3d SpLU ReplaceTinyPivot YES, NO (default) 2d, 3d Algo3d YES, NO (default) 3d DiagInv YES, NO (default) 2d num_lookaheads SUPERLU_NUM_LOOKAHEADS default 10 2d, 3d (Section 3.2) superlu_maxsup SUPERLU_MAXSUP default 256 2d, 3d (Section 5) superlu_relax SUPERLU_RELAX default 60 2d, 3d superlu_rankorder SUPERLU_RANKORDER default Z-major 3d (Section 2.1) superlu_lbs SUPERLU_LBS default GD 3d (Section 2.1) superlu_acc_oload SUPERLU_ACC_OFFLOAD 0, 1 (default) 2d, 3d (Section 3) superlu_n_gemm SUPERLU_N_GEMM default 5000 2d (Section 3.1) superlu_max_bufer_size SUPERLU_MAX_BUFFER_SIZE default 250M words 2d, 3d (Section 3.1) superlu_num_gpu_streams SUPERLU_NUM_GPU_STREAMS default 8 2d (Section 3.1) superlu_mpi_process_per_gpuSUPERLU_MPI_PROCESS_PER_GPU default 1 3d (Section 3.2) OMP_NUM_THREADS default system dependent 2d, 3d (Section 5.4) OMP_PLACES default system dependent 2d, 3d OMP_PROC_BIND default system dependent 2d, 3d OMP_NESTED default system dependent 2d, 3d OMP_DYNAMIC default system dependent 2d, 3d SpTRSV IterReine 0: NOREFINE (default) 2d, 3d (Section 4) 1: SLU_SINGLE 2: SLU_DOUBLE Others PrintStat NO, YES (default) 2d, 3d are plotted in Fig. 7b. The samples on the Pareto front and the default one are shown in Table 5. Compared to the default coniguration, both the time and memory can be signiicantly improved by increasing the computation granularity (larger SUPERLU_MAXSUP, SUPERLU_RELAX). Also, less GPU oload (larger SUPERLU_N_GEMM) leads to better performance. ACM Trans. Math. Softw. SuperLU_DIST Release version-8 • 13 (a) 3D CPU SpLU (b) 2D GPU SpLU (c) 3D GPU SpLU 4000 4 All samples 10 Pareto optima Default 10000 1.5 All samples Pareto optima All samples Default Pareto optima 4000 Default 0 1 2 3 10 10 10 10 15 20 40 80 10 20 40 Fig. 7. Samples generated by GPTune for the three tuning experiments. Only valid samples are ploted. 5.3 3D GPU SpLU parameter tuning For the 3D GPU SpLU algorithm in Section 3.2, we use 2 NERSC Perlmutter GPU nodes with 4 MPI ranks per node and the H2O matrix. The number of OpenMP threads is set to 16, and � � � = 8. We consider the following � � � tuning parameters [ ColPerm, SUPERLU_MAXSUP, SUPERLU_RELAX, SUPERLU_MAX_BUFFER_SIZE, � , � ]. We set � � up GPTune to generate 200 samples. All samples and the Pareto front are plotted in Fig. 7c. The samples on the Pareto front and the default one are shown in Table 6. Compared to the default coniguration, both the time and memory utilization can be signiicantly improved by increasing the computation granularity and decreasing GPU bufer sizes. ColPerm=‘4’ (METIS_AT_PLUS_A) is always preferable in terms of memory usage. The efe � cts of and � are insigniicant as only 8 MPI ranks are used. SUPERLU_MAXSUP SUPERLU_RELAX num_lookaheads� � Time (s)Memory (MB) � � Default 256 60 10 16 1 5.6 2290 Tuned 31 25 17 16 1 21.9 2253 Tuned 53 35 7 4 4 1.64 2360 Table 4. Default and optimal samples returned by GPTune for the 3D CPU SpLU algorithm. Note �that is derived by � = 512/(� � ), as the total MPI count is fixed at 512. � � � ColPermSUPERLU_MAXSUP SUPERLU_RELAX SUPERLU_N_GEMM SUPERLU_MAX_BUFFER_SIZE � TimeMemory (s) (MB) Default ‘4’ 256 60 1000 2.5E8 4 20.8 6393 Tuned ‘4’ 154 154 2048 2.68E8 2 13.5 6011 Tuned ‘4’ 345 198 262144 6.7E7 2 13.2 6813 Tuned ‘4’ 124 110 8192 1.3E8 2 14.6 5976 Table 5. Default and optimal samples returned by GPTune for the 2D GPU SpLU algorithm. Note �that is derived by � = 8/� , as the total MPI count is fixed at 8. � � ACM Trans. Math. Softw. 14 • Xiaoye S. Li, Paul Lin, Yang Liu, and Piyush Sao ColPermSUPERLU_MAXSUP SUPERLU_RELAX SUPERLU_MAX_BUFFER_SIZE � � Time (s)Memory (MB) � � Default ‘4’ 256 60 2.5E8 4 1 25.3 3520 Tuned ‘4’ 176 143 1.34E8 2 1 12.1 3360 Tuned ‘4’ 327 182 1.34E8 4 2 7.4 3752 Tuned ‘4’ 610 200 3.34E7 8 1 12.5 3280 Tuned ‘4’ 404 187 3.34E7 1 2 8.76 3744 Tuned ‘4’ 232 199 3.34E7 4 2 6.7 3936 Table 6. Default and optimal samples returned by GPTune for the 3D GPU SpLU algorithm. Note�that is calculated from � and � as the total MPI count is fixed at 8. � � 5.4 Tuning of OpenMP Intra-node Parallelism SuperLU_DIST can use shared-memory parallelism on CPUs in two ways. The irst is by using the multithreaded BLAS library for linear-algebraic operations. This is independent of the implementation SuperLU_DIST itself. of The second is throughSuperLU_DIST’s direct use of OpenMP pragmas for explicit parallelization of some computations. OpenMP is portable across a wide variety of CPU architectures and operating systems. OpenMP ofers a shared- memory programming model, which can be easier to use than a message-passing programming model. In this section, we discuss the advantages and limitations of using OpenMP, and ofer some performance considerations. Advantage of OpenMP Parallelism:We have empirically observed that hybrid programming with MPI+OpenMP often requires less memory than pure MPI. This is because OpenMP does not require additional memory for message passing bufers. In most cases, correctly tuned hybrid programming with MPI+OpenMP provides better performance than pure MPI. Limitations of OpenMP Parallelism: Beneits of OpenMP parallelism is often less predictable than pure MPI parallelism because of non-determinism in the threading layer, CPU hardware, and thread ainities. Finding the right coniguration for OpenMP may take some trials and errors because performance depends on many factors: CPU architecture, number of cores and threads, the threading library being used, and the operating system. OpenMP threading may cause a signiicant slowdown if parameters are chosen incorrectly. Slowdown can be due to false-sharing, NUMA efects, hyperthreading, incorrect or suboptimal thread ainities, or underlying threading libraries. OpenMP performance tuning:To get the best performance, we recommend tuning the following OpenMP variables environment variables. OMP_NUM_THREADS governs the number of OpenMP threads thatSuperLU_DIST can use. To avoid resource over-subscription, the product of MPI processes per node and OpenMP threads should be less than or equal to available physical OMP_PLACES cores. deines where the threads may run. Possible values are cores, threads, or socket. OMP_PROC_BIND controls the thread migration. When OMP_PROC_BIND the directive is set to TRUE, OpenMP threads should not be moved; when FALSE they may move between hardware cores and sockets. In general, whenOMP_PLACES is set, the setting OMP_PROC_BIND of should be set to TRUE. The other two less commonly used OpenMP environment variables OMP_NESTED are Ðcontrols levels of nested parallelism,OMP_DYNAMIC and Ðdetermines whether to change the number of threads or thread groups dynamically. Both variables are set to False by default, which works well in most systems. For performance debugging purposes, however, we can explicitly set the two variables. In Fig. 8, we show the impact of diferent OpenMP variables and hybrid MPI-OpenMP conigurations on the SpLU speed on Cori Haswell nodes at NERSC. Figure 8a shows the best performance achieved for diferent OpenMP and NUMA settings variables for purely threaded conigurations. Figure 8b shows the performance for diferent MPI×OpenMP threads on four Haswell nodes of Cori. It should be noted that, hybrid conigurations, i.e. ACM Trans. Math. Softw. SuperLU_DIST Release version-8 • 15 conigurations with more than one OpenMP threads per MPI process, tends to require far less memory for MPI’s internal bufers 11[]. In general, using 2-8 OpenMP threads per MPI process gives good performance across a wide range of matrices. (b) Performance w.r.t MPI× OMP threads on 4 (a) Best performance for diferent OMP variables nodes 2.1 70.9 70.9 70.9 70.970.7 70.9 128 × 1 16 × 8 70 67.9 66.9 2.00 64 × 2 8 × 16 64.6 32 × 4 4 × 32 1.75 56.3 53.9 53.9 1.6 50 1.50 1.3 1.3 1.25 1.2 1.0 1.0 1.0 1.0 1.00 30 0.9 0.9 0.9 0.75 0.50 0.25 NumaAlloc OMP_NUM_THREADS OMP_PLACES OMP_PROC_BIND OMP_DYNAMIC 0.00 nd24k torso3 matrix Fig. 8. OpenMP performance tuning for SpLU on Cori Haswell nodes at NERSC. In Fig. 8a, each bar shows the best performance obtained for a variableśvalue pair by an exhaustive parametric search for the other four variables; the test matrix is torso3 and number of threads is 32. The OpenMP API allows control of these variables programmatically. This becomes useful when the application and SuperLU require diferent OpenMP conigurations. For best performance, the user can use our autotuner GPTune to tune these variables automatically, see Section 5. 6 CONCLUSION In this paper, we presented the recently added features in the distributed-memory sparse direSuperLU_DIST ct solver . They represent signiicant algorithmic advances, including (1) communication-avoiding 3D sparse LU factor- ization, (2) support of multi-GPU oloading, and (3) mixed-precision computations for higher speed and lower memory consumption. Incorporating these algorithmic changes required solving challenging software engineer- ing problems while bringing the research prototype code to the production code that is usable by the users. Furthermore, we documented the parameters that may impact the solver’s performance. The parameters we discussed include those inSuperLU_DIST the library as well as some system’s parameters related to OpenMP. Since the sparse solvers performance are sensitive to both sparse matrix structures and the underlying computer architectures, we show that an autotuner framework, such as GPTune, provides a powerful tool to help choose the best parameters setting. We have plans to incorporate several recent fruitful research results into the future releases of the software. These include (1) communication-avoiding 3D Sp13 TRSV ] and[(2) communication-hiding SpTRSV via one-sided MPI and NVSHMEM direct GPU-to-GPU communication [4, 5]. ACKNOWLEDGMENTS We are grateful to Barry Smith for building the PETSc interface for the new 3D code, for the suggestions to improve the interface of how the parameters should be exposed to the users, and for the detailed feedback of the initial manuscript. This research was supported in part by the Exascale Computing Project (17-SC-20-SC), a collaborative efort of the U.S. Department of Energy Oice of Science and the National Nuclear Security Administration, and in part ACM Trans. Math. Softw. GFLOP/sec Interleave Default 32.0 64.0 cores sockets threads false spread close false true RelativeTime 16 • Xiaoye S. Li, Paul Lin, Yang Liu, and Piyush Sao by the Scientiic Discovery through Advanced Computing (SciDAC) Program under the Oice of Science at the U.S. Department of Energy. REFERENCES [1] A. Azad, A. Buluc, X.S. Li, X. Wang, and J. Langguth. 2020. A Distributed-Memory Algorithm for Computing a Heavy-Weight Perfect Matching on Bipartite Graphs. SIAM J. Scientiic Computing 42, 4 (2020), C143śC168. [2] J. Demmel, Y. Hida, W. Kahan, X.S. Li, S. Mukherjee, and E.J. Riedy. 2006. Error Bounds from Extra-Precise Iterative Reinement. ACM Trans. Math. Softw. 32, 2 (June 2006), 325ś351. [3] J. Demmel, Y. Hida, E.J. Riedy, and X.S. Li. 2009. Extra-precise iterative reinement for overdetermined least squarA esCM problems. Transactions on Mathematical Software (TOMS)35, 4 (2009), 28. [4] Nan Ding, Yang Liu, Samuel Williams, and Xiaoye S.ALi. Message-Driv [n. d.]. en, Multi-GPU Parallel Sparse Triangular Solv.er147ś159. https://doi.org/10.1137/1.9781611976830.14 arXiv:https://epubs.siam.org/doi/pdf/10.1137/1.9781611976830.14 [5] Nan Ding, Samuel Williams, Yang Liu, and Xiaoye S. Le Li. veraging [n. d.]. One-Sided Communication for Sparse Triangular Solv.ers 93ś105. https://doi.org/10.1137/1.9781611976137.9 arXiv:https://epubs.siam.org/doi/pdf/10.1137/1.9781611976137.9 [6] X.S. Li, J.W. Demmel, J.R. Gilbert, L. Grigori, P. Sao, M. Shao, and I. Yamazaki. Sup1999. erLU Users’ Guide. Technical Report LBNL-44289. Lawrence Berkeley National Laboratory. https://portal.nersc.gov/project/sparse/superlu/ug.pdf, Last update: June 2018. [7] X. S. Li and J. W. Demmel. 1998. Making Sparse Gaussian Elimination Scalable by Static ProPiv ceedings oting. of SC98: In High Performance Networking and Computing Conference . Orlando, Florida. [8] X. S. Li and J. W. Demmel. 2003. SuperLU_DIST: A Scalable Distributed-Memory Sparse Direct Solver for Unsymmetric Linear Systems. ACM Trans. Mathematical Software29, 2 (June 2003), 110ś140. [9] Yang Liu, Wissam M. Sid-Lakhdar, Osni Marques, Xinran Zhu, Chang Meng, James W. Demmel, and Xiaoye S. Li. 2021. GPTune: Multitask Learning for Autotuning Exascale Applications. Proceedings In of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming(Virtual Event, Republic of K(PPoPP orea) ’21). Association for Computing Machinery, New York, NY, USA, 234ś246. https://doi.org/10.1145/3437801.3441621 [10] P. Sao, X. Liu, R. Vuduc, and X.S. Li. 2015. A Sparse Direct Solver for Distributed Memory Xeon Phi-accelerated Systems. 29th IEEEIn International Parallel & Distributed Processing Symposium (IPDPS) . Hyderabad, India. [11] P. Sao, R. Vuduc, and X. Li. 2014. A Distributed CPU-GPU Sparse Direct Solv Prero.c.Inof Euro-Par 2014, LNCS Vol. 8632, pp. 487-498. Porto, Portugal. [12] P. Sao, R. Vuduc, and X. Li. 2019. A communication-avoiding 3D algorithm for sparse LU factorization on heterogeneJ. ous systems. Parallel and Distributed Computing (September 2019). https://doi.org/10.1016/j.jpdc.2019.03.004 https://www.sciencedirect.com/science/ article/abs/pii/S0743731518305197. [13] P. Sao, R. Vuduc, and X.S. Li. 2019. A Communication-Avoiding 3D Sparse Triangular Solv ICS 2019: er. InInternational Conference on Supercomputing. Phoenix, AZ. [14] I. Yamazaki and X.S. Li. 2012. New Scheduling Strategies and Hybrid Programming for a Parallel Right-looking Sparse LU Factorization on Multicore Cluster Systems.PrIn oceedings of IEEE International Parallel and Distributed Processing Symposium (IPDPS 2012) . Shanghai, China. ACM Trans. Math. Softw. SuperLU_DIST Release version-8 • 17 APPENDIX A NAMING CONVENTION AND CODE DOCUMENTATION The routines in SuperLU_DIST are divided into driver routinesand computational routines . The routine names are inspired by the LAPACK and ScaLAPACK naming convention. For example, the 2D linear solver driver is pdgssvx, where ‘p’ means parallel, ‘d’ means double precision, ‘gs’ means general sparse matrix format, and ‘svx’ means solving a linear system. Below is a list of double precision user-callable routines. • Driver routines: pdgssvx (driver for the old 2D algorithms), pdgssvx3d (driver for the new 3D algorithms in Section 2). • Computational routines: pdgstrf and pdgstrs are respectively triangular factorization SpLU and triangular solve in the 2D process grid. pdgstrf3d is triangular factorization SpLU in the 3D process grid. These routines take apreprocessed linear system as an input. An experienced user can use them directly in the application code as they can provide greater lexibility. For a new user, however, using them can be cumbersome and error-prone. We recommend using driver routines, which are easier to use. • The pddrive and pddrive3d examples in the EXAMPLE/ directory call the respective driv pdgssvx ers and pdgssvx3d to solve linear systems. Other examples in the same directory,pddrive1 such as , pddrive2, etc., illustrate how to reuse the preprocessing results for a sequence of linear systems with similar structures. The Doxygen generated documentation for all the routines is available at https://portal.nersc.gov/project/sparse/ superlu/superlu_dist_code_html/. Each routine begins with a comment that breaks down input/output arguments and explains the functions of the routine. Although the original User’s Guide contains comprehensive description of various internal data structures and algorithms [6], it does not contain the new features presented here. B FORTRAN 90 INTERFACE In the FORTRAN/ directory, there are Fortran 90 module iles that implement the wrappers for Fortran programs to access the full functionality of the C functions SuperLU_DIST in . The design is based on object-oriented programming concept: deine opaque objects in the C space, which are accessed via handles from the Fortran space. All SuperLU_DIST objects (e.g., process grid, LU structure) are opaque from the Fortran side. They are allocated, deallocated and operated at the C side. For each C object, we deine a Fortran handle in Fortran’s user space, which points to the C object and implements the access methods to manipulate the object. All handles are 64-bit integer type. For example, consider creating a 3D process grid. The following code snippet shows what are involved from the Fortran and C sides. • Fortran side /* Declare handle: */ integer(64)::f_grid3d /* Call C wrapper routine to create 3D grid pointed to by "f_grid3d": */ call f_superlu_gridinit3d(MPI_COMM_WORLD, nprow, npcol, npdep, f_grid3d) • C side /* Fortran-to-C interface routine: */ void f_superlu_gridinit3d(int *MPIcomm, int *nprow, int *npcol,int *npdep, int64_t *f_grid3d) /* Actual call to C routine to create grid3d structture in *grid3d{} */ superlu_gridinit3d(f2c_comm(MPIcomm),*nprow, *npcol, *npdep, (gridinfo3d_t *) *f_grid3d); We support four datatypes: ‘s’ (FP32 real), ‘d’ (FP64 double), ‘c’ (FP32 complex) and ‘z’ (FP64 complex). Throughout the paper, we use the ‘d’ version of the routine names. ACM Trans. Math. Softw. 18 • Xiaoye S. Li, Paul Lin, Yang Liu, and Piyush Sao Here, the Fortran handlef_grid3d essentially acts as a 64-bit pointer pointing to the internal 3D grid structure, which is created by the C routine superlu_gridinit3d(). This structure (see Fig. 2) sits in the C space and is invisible from the Fortran side. For all the user-callable C functions, we provide the corresponding Fortran-to-C interface functions, so that the Fortran program can access all the C functionality. These interface routines are implemented in the iles superlu_c2f_wrap.c (precision-independent) and superlu_c2f_dwrap.c (double precision). The Fortran-to- C name mangling is handled by CMake through the headerSRC/superlu_FCnames.h ile . The module ile superlupara.f90 deines all the constants matching the enum constants deined in the C side (see Table 3). The module ile superlu_mod.f90 implements all the access methods (set/get) for the Fortran side to access the objects created in the C user space. C INSTALLATION WITH CMAKE OR SPACK C.1 Dependent external libraries One can have a bare minimum installation SuperLU_DIST of without any external dependencies, although the following external libraries are useful for high performance: BLAS, (Par)METIS (sparsity-preserving ordering), CombBLAS (parallel numerical pivoting) and LAPACK (for inversion of dense diagonal block). C.2 CMake installation The user will need to create a build tree from which to invoke CMake. The following describes how to deine the external libraries. BLAS (highly recommended) If the user has a fast BLAS library on their machine, the user can link it using the following CMake deinition: -DTPL_BLAS_LIBRARIES="<BLAS library name>" Otherwise, theCBLAS/ subdirectory contains the part of the C BLAS (single threaded) neede SuperLU_DIST d by , but it is not optimized for performance. The user can compile and use this internal BLAS with the following CMake deinition: -DTPL_ENABLE_INTERNAL_BLASLIB=ON ParMETIS (highly recommended) http://glaros.dtc.umn.edu/gkhome/fetch/sw/parmetis/parmetis-4.0.3.tar.gz The user can install ParMETIS and deine the two environment variables as follows: export PARMETIS_ROOT=<Prefix directory of the ParMETIS installation> export PARMETIS_BUILD_DIR=${PARMETIS_ROOT}/build/Linux-x86_64 Note that by default, we use serial METIS as the sparsity-preserving ordering, which is available in the ParMETIS package. The user can disable ParMETIS during installation with the following CMake deinition: -DTPL_ENABLE_PARMETISLIB=OFF. In this case, the default ordering is setMMD_AT_PLUS_A to be . See Table 3 for all the possible ColPerm options. In order to use parallel symbolic factorization function, the user needs to use ParMETIS ordering. LAPACK (highly recommended) In the triangular solve routine, we may use LAPACK to explicitly invert the dense diagonal block to improve the performance. The user can use it with the following CMake option: ACM Trans. Math. Softw. SuperLU_DIST Release version-8 • 19 -DTPL_ENABLE_LAPACKLIB=ON CombBLAS (optional) https://people.eecs.berkeley.edu/~aydin/CombBLAS/html/index.html In order to use parallel weighted matching HWPM (Heavy Weight Perfect Matching) for numerical pre- pivoting [1], the user needs to install CombBLAS and deine the environment variables: export COMBBLAS_ROOT=<Prefix directory of the CombBLAS installation> export COMBBLAS_BUILD_DIR=${COMBBLAS_ROOT}/_build Then, install with the CMake option: -DTPL_ENABLE_COMBBLASLIB=ON Use GPU The user can enable (NVIDIA) GPU with CUDA with the following CMake option: -DTPL_ENABLE_CUDALIB=TRUE The user can enable (AMD) GPU with HIP with the following CMake option: -DTPL_ENABLE_HIPLIB=TRUE For a simple installation with default settings: mkdir build ; cd build; cmake .. \ -DTPL_PARMETIS_INCLUDE_DIRS="${PARMETIS_ROOT}/include;\ ${PARMETIS_ROOT}/metis/include" \ -DTPL_PARMETIS_LIBRARIES="${PARMETIS_BUILD_DIR}/libparmetis/libparmetis.a;\ ${PARMETIS_BUILD_DIR}/libmetis/libmetis.a" \ There are a number of example build scripts example_script/ in the directory, with ilenames run_cmake_build_*.sh that target various machines. To actually build (compile), type: ‘make’. To install the libraries, type: ‘make install’. To run the installation tests, type: ‘test’. (The outputs are in ile: ‘build/Testing/Temporary/LastTest.log’) or, ‘ctest -D Experimental’, or, ‘ctest -D Nightly’. Note that the parallel execution ctest in is invoked by the "mpiexec" command, which is from the MPICH environment. If the MPI is not MPICH/mpiexec based, the test execution may fail. The user can pass the deinition option-DMPIEXEC_EXECUTABLE to CMake. For example on Cori at NERSC, the user will need the following: cmake .. -DMPIEXEC_EXECUTABLE=/usr/bin/srun. Or, the user can always go toTEST/ directory to perform testing manually. The following list summarizes the commonly used CMake deinitions. In each case, the irst choice is the default setting. After running a ‘cmake’ installation, a coniguration header ile isSRC/superlu_dist_config.h generated in , which contains the key CPP deinitions used throughout the code. -DTPL_ENABLE_INTERNAL_BLASLIB=OFF | ON -DTPL_ENABLE_PARMETISLIB=ON | OFF -DTPL_ENABLE_LAPACKLIB=OFF | ON -DTPL_ENABLE_COMBBLASLIB=OFF | ON -DTPL_ENABLE_CUDALIB=OFF | ON -DCMAKE_CUDA_FLAGS=<...> -DTPL_ENABLE_HIPLIB=OFF | ON -DHIP_HIPCC_FLAGS=<...> ACM Trans. Math. Softw. 20 • Xiaoye S. Li, Paul Lin, Yang Liu, and Piyush Sao -Denable_complex16=OFF | ON (double-complex datatype) -Denable_single=OFF | ON (single precision real datatype) -DXSDK_INDEX_SIZE=32 | 64 (integer size for indexing) -DBUILD_SHARED_LIBS= OFF | ON -DCMAKE_INSTALL_PREFIX=<...> -DCMAKE_C_COMPILER=<MPI C compiler> -DCMAKE_C_FLAGS=<...> -DCMAKE_CXX_COMPILER=<MPI C++ compiler> -DCMAKE_CXX_FLAGS=<...> -DXSDK_ENABLE_Fortran=OFF | ON -DCMAKE_Fortran_COMPILER=<MPI F90 compiler> C.3 Spack installation Spack installation SuperLU_DIST of is a fully automated process. Assume that the develop branch of Spack (https://github.com/spack/spack) is used. The user can ind available compilers spackvia: compilers. In the following, let’s assume the available compiler is gcc@9.1.0. The installation supports the following variants: Use 64-bit integer The user can enable 64-bit integer with: spack install superlu-dist@master+int64%gcc@9.1.0 Use GPU The user can enable (NVIDIA or AMD) GPUs with: spack install superlu-dist@master+cuda%gcc@9.1.0 spack install superlu-dist@master+rocm%gcc@9.1.0 Test installation The user can run a few smoke tests of the spack installation via spack test run superlu-dist@master (pick the appropriate installation if multiple variants available) ACM Trans. Math. Softw. http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png ACM Transactions on Mathematical Software (TOMS) Association for Computing Machinery

Newly Released Capabilities in the Distributed-Memory SuperLU Sparse Direct Solver

Loading next page...
 
/lp/association-for-computing-machinery/newly-released-capabilities-in-the-distributed-memory-superlu-sparse-cd0d6U05Tx

References

References for this paper are not available at this time. We will be adding them shortly, thank you for your patience.

Publisher
Association for Computing Machinery
Copyright
Copyright © 2023 Association for Computing Machinery.
ISSN
0098-3500
eISSN
1557-7295
DOI
10.1145/3577197
Publisher site
See Article on Publisher Site

Abstract

Newly Released Capabilities in Distributed-memory SuperLU Sparse Direct Solver XIAOYE S. LI, PAUL LIN, and YANG LIU, Lawrence Berkeley National Laboratory, USA PIYUSH SAO, Oak Ridge National Laboratory, USA We present the new features available in the recent releaseSuperLU_DIST of , Version 8.1.1.SuperLU_DIST is a distributed- memory parallel sparse direct solver. The new features include (1) a 3D communication-avoiding algorithm framework which trades of inter-process communication for selective memory duplication, (2) multi-GPU support for both NVIDIA GPUs and AMD GPUs, and (3) mixed precision routines that perform single precision LU factorization and double precision iterative reinement. Apart from the algorithm improvements, we also modernized the software build system to use CMake and Spack package installation tools to simplify the installation procedure. Throughout the paper, we describe in detail the pertinent performance-sensitive parameters associated with each new algorithmic feature, show how they are exposed to the users, and give general guidance of how to set these parameters. We illustrate that the solvers performance both in time and memory can be greatly improved after systematic tuning of the parameters, depending on the input sparse matrix and underlying hardware. CCS Concepts: · Mathematics of computing→ Solvers. Additional Key Words and Phrases: Sparse direct solver, communication-avoiding, GPU, mixed-precision 1 OVERVIEW OF SUPERLU_DIST 1,2 SuperLU_DIST is a distributed-memory parallel sparse direct solver library for solving large sets of linear equations�� = � [6]. Here � is a square, nonsingular�,× � sparse matrix, and� and � are dense � × ��ℎ� matrices, where��ℎ�is the number of right-hand sides and solution vectors. The matrix � need not be symmetric or deinite; indeed,SuperLU_DIST is particularly appropriate for unsymmetric matrices, and it respects both the unsymmetric values as well as the unsymmetric sparsity pattern. The library uses variations of Gaussian elimination (LU factorization) optimized to take advantage of the sparsity of the matrix and modern high performance computer architectures (speciically memory hierarchy and parallelism). It is implemented in ANSI C, using MPI for communication, OpenMP for multithreading, and CUDA (or HIP) for NVIDIA (or AMD) GPUs. The library includes routines to handle both real and complex matrices in single and double precisions, and some functions with mixed precisions. The distributed-memory parallel algorithm consists of the following major steps. (1) Preprocessing (2) Sparse LU factorization (SpLU) https://github.com/xiaoyeli/superlu_dist https://portal.nersc.gov/project/sparse/superlu/superlu_dist_code_html/index.html Authors’ addresses: Xiaoye S. Li, xsli@lbl.gov; Paul Lin, paullin@lbl.gov; Yang Liu, liuyangzhuan@lbl.gov, Lawrence Berkeley National Laboratory, One Cyclotron Road, Berkeley, CA, USA, 94720; Piyush Sao, saopk@ornl.gov, Oak Ridge National Laboratory, Oak Ridge, TN, USA, 37830. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for proit or commercial advantage and that copies bear this notice and the full citation on the irst page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior speciic permission and/or a fee. Request permissions from permissions@acm.org. © 2022 Association for Computing Machinery. 0098-3500/2022/12-ART $15.00 https://doi.org/10.1145/3577197 ACM Trans. Math. Softw. 2 • Xiaoye S. Li, Paul Lin, Yang Liu, and Piyush Sao (3) Sparse triangular solutions (SpTRSV) (4) Iterative reinement (IR) (optional) The preprocessing in Step (1) transforms the original linear system �� = � into�� = �, so that the latter one has more favorable numerical properties and sparsity structures.SuperLU_DIST In , typically � is irst transformed into� = � � � �� � . Here � and � are diagonal scaling matrices to equilibrate the system, which tends to � � � � � � reduce condition number of the matrix and avoid over/underlow during Gaussian elimination. � and � are � � permutation matrices . The role of� is to permute rows of the matrix to make diagonal elements large relative to the of-diagonal elements (numerical pivoting). The role � is ofto permute rows and columns of the matrix to minimize the ill-in in�the and � factors (sparsity reordering). Note that we apply � symmetrically so that the large diagonal entries remain on the diagonal. With these transformations, the linear system to be solved is: � −1 (� � � �� � )(� � )� = � � � �. In the software coniguration, each transformation can be turned of, or � � � � � � � � � � can be achieved with diferent algorithms. Further algorithm details and user interfaces can be found 6, 8].in [ After these transformations, the last preprocessing step is symbolic factorization which computes the distributed nonzero structures of the� and � factors, and distributes the nonzeros of � into� and � . In Step (2), before the new Version-7 release (2021), the distributed memory code had been largely built upon the design in the irstSuperLU_DIST paper [7]. The main ingredients of the parallel SpLU algorithm are: • supernodal fan-out (right-looking) based on elimination DAGs, • static pivoting with possible half-precision perturbations on the diagonal [7], • 2D logical process arrangement for non-uniform block-cyclic mapping, based on the supernodal block partition, and • loosely synchronous scheduling with lookahead pipelining [14]. In Step (3), The parallel SpTRSV uses a block-cyclic layout for� the and � matrices as in the results of SpLU. It also uses a message-driven asynchronous and dynamically schedule algorithmÐdesigne d d to reduce the communication and latency costs. In Step (4), the user can optionally invoke a few steps of iterative reinement to improve the solution accuracy. The computational kernels in the IR are SpTRSV and sparse matrix-vector multiplication (SpMV). Before this release, the iterative reinement is performed in the same precision as that of SpLU and SpTRSV. This release paper focuses on the new capabilities in Steps (2)-(4). These includes the new 3D communication- avoiding algorithm framework (Section 2) , multi-GPU support (Section 3), mixed precision routines (Section 4), and support for new build tools (Appendix C). Throughout the paper, we discuss all the parameters that may inluence the code performance. These parameters can be set in a compile-time "options" structure, or by environment variables (with capitalized names), the latter of which take precedence. Section 5 gives a summary of the parameters and some tuning results. In particular, we illustrate that the performance can be greatly improved by using an autotuner GPTune 9][for an optimal parameters setting. Finally, Section 6 summarizes our contributions and gives perspectives of future developments. 2 3D COMMUNICATION-AVOIDING ROUTINES We developed a novel 3D algorithm framework for sparse factorization and triangular solutions. This new approach is motivated by the strong scaling requirement from exascale applications. Our novel 3D algorithm framework for sparse factorization and triangular solutions alleviates communication costs by taking advantage of the three-dimensional MPI process grid, the elimination tree parallelism, and the communication-memory tradeofÐinspired from communication-avoiding algorithms for dense linear algebra in the last decade. The 3D processes grid, conigured as � = � × � × � (see Fig. 3a), can be considered as � sets of 2D � � � � processes layers. The distribution of the sparse matrices is governed by the supernodal elimination tree-forest (etree-forest): the standard etree is transformed into an etree-forest which is binary atlog the(top � ) levels and ACM Trans. Math. Softw. SuperLU_DIST Release version-8 • 3 has � subtree-forests at the leaf level (see Fig. 1a). The description of the tree partition and mapping algorithm is described in 12, [Section 3.3]. The matrices � , �, and � corresponding to each subtree-forest are assigned to one 2D process layer. The 2D layers are referred to as Grid-0, Grid-1, . . ., up to (� − 1) grids. Fig. 1b shows the submatrix mapping to the four 2D process grids. (a) 2-level etree partition (b) Matrix view on 4 process grids A 0 0 0 0 0 0 0 0 0 A 1 2 1 1 1 2 2 3 3 A 4 6 A 4 5 6 3 A 5 4 5 Grid-0 Grid-1 Grid-2 Grid-3 Fig. 1. Illustration of the 3D parallel SpLU algorithm with 4 process grids. Note that,�herrefers e to�[�:, �:] typedef struct { MPI_Comm comm; /* MPI communicator */ superlu_scope_t rscp; /* row scope */ superlu_scope_t cscp; /* column scope */ superlu_scope_t zscp; /* scope in third dimension */ gridinfo_t grid2d; /* for using 2D functions */ int iam; /* my process number in this grid */ int nprow; /* number of process rows */ int npcol; /* number of process columns */ int npdep; /* number of replication factor in Z-dimension */ int rankorder; /* = 0: Z-major ( default ) * = 1: XY-major (need set environment variable: SUPERLU_RANKORDER=XY) */ } gridinfo3d_t; Fig. 2. 3D process grid definition. An example for calling the 3D algorithm to solve a sparse linear system is provided by the sample program EXAMPLE/pddrive3d.c, see https://github.com/xiaoyeli/superlu_dist/blob/master/EXAMPLE/pddrive3d.c. Fig. 2 shows the C structure deining the 3D process grid. 2.1 The 3D Process layout and its performance impact A 3D process grid can be arranged in two formats: �� -major or � -major, see Fig. 3. In�� -major format, processes with the same�� -coordinate and diferent � -coordinate have consecutive global ranks. Consequently, when spawning multiple processes on a node, the spawned processes will have �the � same coordinate (except for cases where � is not a multiple of the number of processes spawned on the node). Alternatively, we can arrange the 3D process grid in Z-major format where processes with the same � coordinate have consecutive global ranks. This is the default ordering SuperLU_DIST in . ACM Trans. Math. Softw. 4 • Xiaoye S. Li, Paul Lin, Yang Liu, and Piyush Sao The � -major format can be better for performance as it keeps processes in a 2D grid closer. Hence it may provide higher bandwidth for 2D communication, typically the bottleneck in communication. On the other hand, the �� -major format can be helpful when using GPU acceleration. This can happen since the XY-major ordering will keep more GPUs active during ancestor factorizations. In some cases, e.g., sparse matrices from non-planar graphs, ancestor factorization can become compute dominant, and XY-major ordering helps by keeping more GPUs active. For example, on 16 Haswell nodes of the NERSC Cori Cray XC40, the Z-major ordering was×0.85-1.3 faster than the XY major ordering. Haswell compute nodes have dual-socket 16-core 2.3 GHz Intel Xeon E5-2698v3 CPUs. Note that this performance diference is system-dependent, depending on the hardware topology as well as the job scheduler policy of the parallel machine. (a) 3D process grid (b) łZž and łXYž major ordering 0 1 2 0 2 4 9 10 11 1 3 5 3 4 5 6 4 5 12 13 14 7 9 11 y 6 7 8 12 7 8 15 16 17 13 15 17 Z-major Grid XY-major Grid Fig. 3. A logical 3D process grid and process configuration for two types of process arrangements. The linear solver driver routine pdgssvx3d is , with the calling API explained here: https://portal.nersc.gov/ project/sparse/superlu/superlu_dist_code_html/pdgssvx3d_8c.html The SpLU factorization progresses from leaf�le = log vel� to the root level 0. The two main phases are local factorization and Ancestor-Reduction. (1) Local factorization.In parallel and independently, every 2D process grid performs � factorization the 2 of its locally owned submatrix � . This of is the same algorithm as the one before Version-7 14]. The [ only diference is that each process grid will generate a partial Schur complement update, which will be summed up with the partial updates from the other process grids in the next phase. (2) Ancestor-Reduction.After the factorization of le �, w vel- e reduce the partial Schur complement of the ancestor �−�+1 nodes before factorizing the next level.�In -ththe level’s reduction, the receiver is �2 the -th process �−� grid and the sender is the(2�+ 1)2 -th process grid, for some integer �. The process in the 2D grid which owns a block� has the same (�,�) coordinate in both sender and receiver grids. So communication in the �,� ancestor-reduction step is point-to-point pair-wise and takes places along �-axisthe in the 3D process grid. We analyzed the asymptotic improvements for planar graphs (e.g., those arising from 2D grid or mesh discretizations) and certain non-planar graphs (speciically for 3D grids and meshes). For a planar graph with � vertices, our algorithm reduces communication volume asymptotically � by a factor in ofO log� and latency by a factor ofO (log�). For non-planar cases, our algorithm can reduce the per-process communication volume by× 3 and latency byO � times. In all cases, the extra memory needed to achieve these gains is a small constant factor of �the and � memory. We implemented our algorithm by extending the 2D data structure ACM Trans. Math. Softw. SuperLU_DIST Release version-8 • 5 used inSuperLU. Our new 3D code achieves empirical speedups up×tofor 27 planar graphs and up to 3.3× for non-planar graphs over the baselineSuperLU 2D when run on 24,000 cores of a Cray XC30 (Edison at NERSC). Please see [12] for comprehensive performance tests with a variety of real-world sparse matrices. Remark. The algorithm structure requires that the z-dimension of the 3D process � grid must be a power-of- two integer. There is no restriction on the shape of the 2D�grid and � . The rule of thumb is to deine it as � � square as possible. When square grid is not possible, it is better to set the row dimension � slightly smaller than the column dimension � . For example, the following are good options for the 2D grid: 2x3, 2x4, 4x4, 4x8. Inter-grid Load-balancing in the 3D SpLU Algorithm. The 3D algorithm provides two strategies for partitioning the elimination tree to balance the load between diferent 2D grids. SUPERLU_LBS The environment variable speciies which one to use. • Nested Dissection (ND) strategy uses the partitioning provided by a nested dissection ordering. It works well for regular grids. The ND strategy can only be used when the elimination tree is binary, i.e., when the column order is also ND, and it cannot handle cases where the separator tree has nodes with more than two children. • Greedy Heuristic (GD) strategy uses a greedy algorithm to divide one level of the elimination tree. It seeks to minimize the maximum load imbalance among the children of that node; if the imbalance in children is higher than 20%, it further subdivides the largest child until the imbalance falls below 20%. The GD strategy works well for arbitrary column ordering and can handle irregular graphs; however, if it is used on heavily imbalanced trees, it leads to bigger ancestor sizes and, therefore, more memory than ND. GD strategy is the default strategy unless SUPERLU_LBS=ND is speciied. In summary, two parameters are speciic to the 3D SpLU algorithm: • superlu_rankorder (SUPERLU_RANKORDER) deines the arrangement of the 3D process grid (default is Z-major); • superlu_lbs (SUPERLU_LBS) deines the inter-grid load-balancing strategy (default is GD). 3 GPU-ENABLED ROUTINES In the current release, the SpLU factorization routines can oload certain computations to GPUs, which is mostly in each Schur complement update step. We support both NVIDIA and AMD GPUs. We are actively developing code for the Intel GPUs. To enable GPU oloading, irst a compile-time CMake variable needs to be deined: -DTPL_ENABLE_CUDALIB=TRUE (for NVIDIA GPU with CUDA programming)-DTPL_ENABLE_HIPLIB=TRUE or (for AMD GPU with HIP programming). Then, a runtime environment variable SUPERLU_ACC_OFFLOAD is used to control whether to use GPU or not. By default, SUPERLU_ACC_OFFLOAD=1 is set. (‘ACC’ denotes ACCelerator.) 3.1 2D SpLU GPU algorithm and tuning parameters The irst sparse LU factorization algorithm capable of oloading the matrix-matrix multiplication to the GPU was published in 11].[The panel factorization and the Gather/Scatter operations are performed on the CPU. This algorithm has been availableSuperLU_DIST since version 4.0 of the code (October 2014); however, many users are uncertain about using it correctly due to limited documentation. This section provides a gentle introduction to GPU acceleration SuperLU_DIST in and its performance tuning. Performing the Schur complement update requires some temporary storage to hold dense blocks. In an earlier algorithm, at each elimination step, the Schur complement update is performed block by block. After performing updates on a block, the temporary storage can be reused for the next block. A conspicuous advantage of this approach is its memory eiciency, since the temporary storage required is bounded by maximum block size. The maximum block size is a tunable parameter that trades of local performance of matrix-matrix multiplication ACM Trans. Math. Softw. 6 • Xiaoye S. Li, Paul Lin, Yang Liu, and Piyush Sao (GEMM) with inter-process parallelism. A typical setting for the maximum block size is 512 (or smaller). However, a noticeable disadvantage of this approach is that it fails to fully utilize the abundance of local ine-grained parallelism provided by GPUs because each GEMM is too small. In [11], we modiied the algorithm in the Schur complement update step. At each �, w step e irst copy the individual blocks (in skyline storage) �th blo in ck therow of � into a consecutive bufer U(�, :). The �(:, �) is already in consecutive storage thanks to the supernodal structure. We then perform a single GEMM call to compute � ← �(:, �) × � (�, :). The matrix� is preallocated and the size � ne ofeds to be suiciently large to achieve close to peak GEMM performance. If the size �(:of , �) × � (�, :) is larger than � , then we partition the product into several large chunks such that each chunk requires temporary storage smaller � . Giv thanen that modern GPUs have considerably more memory than earlier generations, this extra memory can enable a much faster runtime. Now, each step of the Schur complement update consists of the following substeps: (1) Gather sparse blocks� (�, :) into a dense BLAS compliant bufer U(�, :); (2) Call dense GEMM � ← �(:, �) × U(�, :) (leading part on CPU, trailing part on GPU); and (3) Scatter � [] into the remaining (�+1 : �, �+1 : � ) sparse � and � blocks. It should be noted that the Scatter operation can require indirect memory access, and therefore, it can be as expensive as the GEMM cost. The Gather operation, however, has a relatively low overhead compared to other steps involved. The GEMM oload algorithm tries to hide the overhead of Scatter and data transfer between the CPU and GPU via software pipelining. Here, we discuss the key algorithmic aspects of the GEMM oload algorithm: • To keep both the CPU and GPU busy, we divide the U(�, :) into a CPU part and GPU part, so that the GEMM call is split into [ cpu : gpu�](parts: :, �) × U(�,[���]) and �(:, �) × U(�,[���]). To hide the data transfer cost, the algorithm further divides the GEMM into multiple streams. Each stream performs its own sequence of operations: CPU-to-GPU transfer, GEMM, and GPU-to-CPU transfer. Between these streams, these operations are asynchronous. The GPU matrix multiplication is also pipelined with the Scatter operation performed on the CPU. • To ofset the memory limitation on the GPU, we devised an algorithm to divide the Schur complement update into smaller chunks[as ��{� : ���] | [��� : ���] | . . . }. These chunks depend on the available 1 2 memory on the GPU and can be sized by the user. A smaller chunk size will result in many iterations of the loop. There are three environment variables that can be used to control the memory usage and performance in the GEMM oload algorithm: • superlu_n_gemm (SUPERLU_N_GEMM) is the minimum value of the pro�duct �� for a GEMM call to be worth oloading to the GPU (default is 5000); • superlu_num_gpu_streams (SUPERLU_NUM_GPU_STREAMS) deines the number of GPU streams to use (de- fault is 8); and • superlu_max_buffer_size (SUPERLU_MAX_BUFFER_SIZE) deines the maximum bufer size on GPU that can hold the GEMM output matrix � (default is 256M in loating-point words). This simple GEMM oload algorithm has limited performance gains. We observed a×rsp oughly eedup 2-3 over the CPU-only code for a range of sparse matrices. 3.2 3D SpLU GPU algorithm and tuning parameters We extend the 3D algorithm for heterogeneous architectures by adding the Highly Asynchronous Lazy Oload (Halo) algorithm for co-processor oload 10].[ Compared to the GPU algorithm in the 2D code ( Section 3.1), ACM Trans. Math. Softw. SuperLU_DIST Release version-8 • 7 this algorithm also oloads the Scatter operations of each Schur complement update step to the GPU (in addition to the GEMM call). On 4096 nodes of a Cray XK7 (Titan at ORNL) with 32,768 CPU cores and 4096 Nvidia K20x GPUs, the 3D algorithm achieves empirical speedups up × for to 24 planar graphs and 3.5× for non-planar graphs over the baseline 2D SuperLU with co-processor acceleration. The performance related parameters are: • superlu_num_lookaheads (SUPERLU_NUM_LOOKAHEADS), number of lookahead levels in the Schur-complement update (default is 10). In order to reduce the critical path of the sequence of panel factorizations, we softwar deviseedpip a elining method to overlap the panel factorization of the processes at � +step 1 with the Schur complement update of the other processes at step�. When there are multiple remaining supernodes in the Schur complement, the lookahead window (i.e., pipeline depth) can be greater14 than ]. This 1 [ environment variable deines the width of the lookahead window. • superlu_mpi_process_per_gpu (MPI_PROCESS_PER_GPU) (default is 1). The Halo algorithm uses GPU memory based on its availability. To do this correctly, the algorithm needs to know how many MPI processes are running on a GPU, which can be diicult to determine on some systems. This environment variable can be set to inform SuperLU_DIST that there are N ranks on each GPU so that it can limit its memory usage of each GPU to 90% of available memory shared among all MPI processes, which will, in turn, limit the amount of memory used by each rank. 3.3 2D SpTRSV GPU algorithm When the 2D grid has one MPI rank, SpTRSV in SuperLU_DIST is parallelized using OpenMP for shared-memory processors and CUDA or HIP for GPUs. Both versions of the implementations are based on an asynchronous level- set traversal algorithm that distributes the computation workload across CPU threads and GPU threads/blo 4]. cks [ The CPU implementation uses OpenMP taskloops and tasks for dynamic scheduling, while the GPU implementa- tion relies on static scheduling. Fig. 4a shows the performance of SpTRSV (L and U solves) on 1 Cori Haswell node with 1 and 8 OpenMP threads with a number of matrices. The speedup ranges between 1.4 and 4.3. Fig. 4b shows the performance of L-solve using SuperLU_DIST (8 ORNL Summit IBM POWER9 CPU cores or 1 Summit V100 GPU) and cuSPARSE (1 Summit V100 GPU). The GPU SpTRSV SuperLU_DIST in consistently outperforms cuSPARSE and is comparable to the 8-core CPU results. Here we choose 8 CPU cores as there are on average 7 CPU cores per GPU on Summit, and 8 is the closest power of 2 number. Note that GPU performance of the U-solve requires major improvements and is not available in the current release. That said, we compare the performance of SpTRSV (both L and U solves) on one Summit node using three conigurations: 1. (baseline) 1-core L solve and 1-core U solve, 2. (GPU) 1-GPU L solve and 1-core U solve, and 3. (GPU+OpenMP) 1-GPU L solve and 8-core U solve. The speedups comparing to the baseline coniguration are shown in Table 1. When the 2D grid has more than 1 MPI rank, SpTRSV also supports OpenMP parallelism with less speedups. In addition, the multi-GPU SpTRSV SuperLU_DIST in is under active development and will be available in future releases. The number of OpenMP threads can be controlled by the environment variable OMP_NUM_THREADS, and the GPU SpTRSV can be turned on with the -DGPU_SOLVE compiler lag. The user needs to make sure that only 1 MPI rank is used for the 2D grid when GPU SpTRSV is employed. 4 MIXED-PRECISION ROUTINES SuperLU_DIST has long supported four distinct loating-point types: IEEE FP32 real and complex, IEEE FP64 real and complex. Furthermore, the library allows all four datatypes to be used together in the same application. ACM Trans. Math. Softw. 8 • Xiaoye S. Li, Paul Lin, Yang Liu, and Piyush Sao (a) L and U solve (in Mflops) with 1 and 8 OpenMP threads on Cori Haswell (b) L solve ofSuperLU and cuSparse on CPU and GPU 2.5 CUSPARSE 1 OpenMP thread 8 OpenMP threads GPU SuperLU CPU(8-core) SuperLU 1.5 0.5 Fig. 4. Performance of SpTRSV with 1 MPI rank for a variety of sparse matrices. copter2 epb3 gridgenavanbody shipsec1dawson5 GPU vs. Baseline 1.6 1.7 1.6 1.6 1.54 1.6 GPU+OpenMP vs. Baseline 5.3 5.7 5.3 4.4 4.1 5.2 Table 1. Speedup of GPU SpTRSV compared with sequential CPU SpTRSV. Recent hardware trends have motivated increased developmentmixe of d-precision numerical libraries, mainly because hardware vendors have started designing special-purpose units for low precision arithmetic with higher speed. For direct linear solvers, a well understood method is to use lower precision to perform factorization (expensive) and higher precision to perform iterative reinement (IR) to recover accuracy (cheap). For a typical sparse matrix resulting from the 3D inite diference discretization of a regular mesh, theOSpLU � lops needs 4/3 while each IR step needs only O � lops (including SpTRSV and SpMV). For dense LU and QR factorizations, the beneit of lower precision format comes mainly from accelerated GEMM speed. But in the sparse case, the dimensions of the GEMM are generally smaller and of non-uniform size throughout factorization. Therefore, the speed gain from GEMM alone is limited. In addition to GEMM, a nontrivial cost is the Scatter operation. From our tests, we found that the fraction of time in GEMM usually is less than 50%. Because of this, the TensorCore version of GEMM calls led to a less than 5% speedup for the whole SpLU. When comparing FP32 with the FP64 versions of SpLU, we observed about 50% speedup with the FP32 version. We recall the IR algorithm using three precisions in Algorithm 2, 3]. This1algorithm [ is already available as xGERFSX functions in LAPACK, where the input matrix is dense and so is LU. The following three precisions may be used: • � is the working precision; it is used for the input � anddata �, and output �. (�) 2 • � is the precision for the computed solution � . We require� ≤ � , possibly � ≤ � if necessary for � � � � componentwise convergence. (�) 2 • � is the precision for the residuals � . We usually hav�e≤ � , i.e., at least twice the working precision. � � ACM Trans. Math. Softw. globalmat118_1536 matrix05 copter2 epb3 gridgena vanbody shipsec1 dawson5 gas_sensor rajat16 Gflops matrix05 copter2 gas_sensor xenon2 shipsec1 xenon1 g7jac160 g7jac140sc mark3jac100sc ct20stif vanbody dawson5 gridgena epb3 torso2 torsion1 rajat16 hcircuit SuperLU_DIST Release version-8 • 9 (a) audikw_1 convergence history (b) Ga19As19H2 convergence history Fig. 5. Convergence history of Algorithm 1 when applied to two sparse linear systems. The vertical line in each plot −24 corresponds to the IR steps taken when our stopping criteria are satisfied. Here, the working precision � = is 2 . The blue −53 lines are all single. The red lines correspond to� = � = 2 , but factorization is in single. � � Algorithm 1 Three-precisions Iterative Reinement (IR) for Direct Linear Solvers (1) 1: Solve�� = � using the basic solution method (e.g., LU or QR) ⊲ (� ) 2: �= 1 3: repeat (�) (�) 4: � ← �− �� ⊲ (� ) (�+1) (�) 5: Solve� �� = � using the basic solution method ⊲ (� ) (�+1) (�) (�+1) 6: Update � ← � + �� ⊲ (� ) 7: �← �+ 1 (�) 8: until � is łaccurate enoughž (�) 9: return � and error bounds Algorithm 1 converges with small normwise (or componentwise) error and error bound if the normwise (or def componentwise) condition number�of does not exceed 1/(�� ), where � = ��� (���(�(�,:))). Moreover, � � this IR procedure can return to the user both normwise and componentwise reliable error bounds. The error analysis in [2] should carry through to the sparse cases. We implemented AlgorithmSuperLU_DIST 1 in , using two precisions in IR: −24 −53 • � = 2 (IEEE-754 single precision), � = � = 2 (IEEE-754 double precision) � � � In Figure 5, the left two plots show the convergence history of two systems, in both normwise forward and backward errors, � and � , respectively (deined below). We perform two experiments: one using single ��� ��� precision IR, the other using double precision IR. As can be seen, single precision IR does not �redu,ce much ��� while double precision ��delivers � close to� . The IR time is usually under 10% of the factorization time. ��� � Overall, the mixed-precision speed is still faster than using pure FP64, see Table 2. The 2D driver routine for this mixed-precision appr psgssvx_d2 oach is , where the suix "d2" denotes that the intermediate � vector and � vector internal to the IR routine are carried in double precision. ACM Trans. Math. Softw. 10 • Xiaoye S. Li, Paul Lin, Yang Liu, and Piyush Sao Table 2. Parallel solution time (seconds) (including SpLU and IR): purely double precision, purely single precision, and mixed precision (FP32 SpLU + FP64 IR). ORNL Summit using up to 8 nodes, each node uses 6 CPU Cores (C) and 6 GPUs (G). Matrix Precision6 C+G 24 C+G 48 C+G Matrix Precision6 C+G 24 C+G 48 C+G audikw_1 Double 65.9 21.1 18.9 Ga19As19H42 Double 310.9 62.4 34.3 Single 45.8 13.8 10.5 Single 258.1 48.2 25.8 Mixed 49.2 13.9 11.4 Mixed 262.8 48.8 26.1 The only diference from the one-precision routine psgssvx is the output arrayerr_bounds[] (error bounds). For each right-hand side, we return the following three quantities: (�+1) (�) (�) ∥�� ∥ ∞/∥� ∥ ∥� −�∥ ∞ ∞ • err_bounds[0]: normwise forward error bound: � = max ,�� ≈ ���� � 1−� ∥�∥ ��� ∞ (�+1) def ∥�� ∥ ∞ (�) where, � = max is the estimate of the convergence rate�of. max �≤� (�) ∥�� ∥ (�) −1 (�) � −� ∥� �� ∥ � ∞ � • err_bounds[1]: componentwise forward error bound: max ,�� ≈ max � � 1−�ˆ � ��� � (�+1) ∥��� ∥ ∞ −1 (�) where, � = ����(�), �ˆ = max is the estimate of the convergence rate�of� max �≤� (�) ∥��� ∥ (�) |�−�� | • err_bounds[2]: componentwise backward error: max (�) (|�||� |+|�|) 5 SUMMARY OF TUNING PARAMETERS AND TUNING RESULTS Throughout all phases of the solution process, a number of algorithm parameters can inluence the solver’s performance. These parameters can be modiied by the user. For each user-callable routine, the irst argument is usually an input options " " argument, which points to the structure containing a number of algorithm choices. These choices are determined at compile time. The second column in Table 3 lists the namedoptions ields in the argument. The fourth column lists all the possible values and their corresponding C’s enumerated constant names. The user should call the following routine to set up the default values. superlu_dist_options_t options; set_default_options_dist(&options); After setting the defaults, the user can modify each default, for example: options.RowPerm = LargeDiag_HWPM; For a subset of these parameters, the user can change them at runtime via environment variables. These parameters are listed in the third column in Table 3. At various places of the code, an environment inquiry function SRC/sp_ienv.c is called to retrieve the values of the environment variables. Two algorithm blocking parameters can be changed at runtime: SUPERLU_MAXSUP and SUPERLU_RELAX. SUPERLU_MAXSUP sets the maximum size of a supernode. That is, if the number of columns in a supernode exceeds this value, we will split this supernode into two supernodes. Setting this parameter to a large value results in larger blocks and generally better performance for threaded and GPU GEMM. Increasing it limits the number of available parallel tasks across MPI processes. Figure 6a illustrates how performance, as measured in Glops, varies with SUPERLU_MAXSUP on a single node of Cori Haswell when using 32 OpenMP threads. For smaller matrices, such as this one (torso3), performance is near its peak when SUPERLU_MAXSUP equals 128, which is over×50 faster than when this value is set to 4. However, above this value, the performance starts to taper of. SUPERLU_RELAX is a relaxation parameter: if the number of nodes (columns) in a subtree of the elimination tree is less than this value, this subtree is treated as one supernode, regardless of the row structures. That means, we pad explicit zeros to enforce that all the columns within this relaxed supernode have the same row structure. The advantage of this padding is to mitigate many small supernodes at the bottom of the elimination tree. On ACM Trans. Math. Softw. SuperLU_DIST Release version-8 • 11 (a) (b) torso3-OMP_NUM_THREADS=32,Intel-Haswell Mem(L + U) Mem(IDX) Factorization Time 2 108 102 11 100 100 2 60 2 8 2 4 8 16 32 64 128 256 512 4 8 16 32 64 128 256 Max Supernode Size SUPERLU_RELAX Fig. 6. Impact of maximum supernode size (SUPERLU_MAXSUP) and supernodal relaxation (SUPERLU_RELAX) on perfor- mance and memory. The machine is NERSC Cori Haswell node. The matrix is torso3 from SuiteSparse. the other hand, a large value SUPERLU_RELAX of may introduce too many zeros which in turn propagate to the ancestors of the elimination tree, resulting in a large number of ill-ins in the L and U factors. Figure 6b shows the impact of this parameter on the memory use (left axis) and factorization time. A value of 32 or 64 represents a good tradeof between memory and time. The optimal settings of these parameters are matrix-dependent and hardware-dependent. Additionally, several other parameters and environment variables listed in Table 3 are performance critical for the 2D and 3D, CPU and GPU algorithms described in Sections 2, 3.1 and 3.2. It is a daunting task for manual tuning to ind the optimal setting of these parameters. Now in Sections 5.1 - 5.3 we show how an autotuner can signiicantly simplify this task. Here we leverage an autotuner called GPTune 9] to [ tune the performance (time and memory) of SpLU. We consider two example matrices from the Suitesparse matrix collection, G3_circuit from circuit simulation and H2O from quantum chemistry simulation. For all the experiments, we consider a two-objective tuning scenario and generate a Pareto front from the samples demonstrating the tradeof between memory and CPU requirement of SpLU. 5.1 3D CPU SpLU parameter tuning For the 3D CPU SpLU algorithm (2), we use 16 NERSC Cori Haswell nodes and the G3_circuit matrix. The number of OpenMP threads is set to 1, so there are a total�of� � = 512 MPI ranks. We consider the following tuning � � � parameters [ SUPERLU_MAXSUP, SUPERLU_RELAX, num_lookaheads,� , � ]. We set up GPTune to generate 100 � � samples. All samples and the Pareto front are plotted in Fig. 7a. The samples on the Pareto front and the default one are shown in Table 4, one can clearly see that by reducing the computation granularity SUPERLU_MAXSUP ( , SUPERLU_RELAX) and increasing � , one can signiicantly improve the SpLU time while using slightly more memory. 5.2 2D GPU SpLU parameter tuning For the 2D GPU SpLU algorithm (3.1), we use 2 NERSC Cray EX Perlmutter GPU compute nodes with 4 MPI ranks per node and the H2O matrix. Perlmutter GPU compute nodes consist of a single 64-core 2.45 GHz AMD EPYC 7763 CPU and four NVIDIA A100 (40GB HBM2) GPUs. The number of OpenMP threads is set to 16. We consider the following tuning parameters [ ColPerm, SUPERLU_MAXSUP, SUPERLU_RELAX, SUPERLU_N_GEMM , SUPERLU_MAX_BUFFER_SIZE, � ]. We set up GPTune to generate 100 samples. All samples and the Pareto front ACM Trans. Math. Softw. Performance in GFLOP/sec % Memory relative to SUPERLU_RELAX=4 Factorization time in seconds 12 • Xiaoye S. Li, Paul Lin, Yang Liu, and Piyush Sao Table 3. List of algorithm parameters used in various phases of the linear solver. The third column lists the environment variables that can be reset at runtime. parameters must be set inoptions{} the structure input to a driver routine. phase options env variables values in 2D or 3D algo. (compile-time) (runtime) (enum constants) Pre- Equil NO, YES (default) 2d, 3d process RowPerm 0: NOROWPERM 2d, 3d 1: LargeDiag_MC64 (default) 2d, 3d 2: LargeDiag_HWPM 2d, 3d 3: MY_PERMR 2d, 3d ColPerm 0: NATURAL 2d, 3d 1: MMD_ATA 2d, 3d 2: MMD_AT_PLUS_A 2d, 3d 3: COLAMD 2d, 3d 4: METIS_AT_PLUS_A (default) 2d, 3d 5: PARMETIS 2d, 3d 6: ZOLTAN 2d, 3d 7: MY_PERMC 2d, 3d ParSymbFact YES, NO (default) 2d, 3d SpLU ReplaceTinyPivot YES, NO (default) 2d, 3d Algo3d YES, NO (default) 3d DiagInv YES, NO (default) 2d num_lookaheads SUPERLU_NUM_LOOKAHEADS default 10 2d, 3d (Section 3.2) superlu_maxsup SUPERLU_MAXSUP default 256 2d, 3d (Section 5) superlu_relax SUPERLU_RELAX default 60 2d, 3d superlu_rankorder SUPERLU_RANKORDER default Z-major 3d (Section 2.1) superlu_lbs SUPERLU_LBS default GD 3d (Section 2.1) superlu_acc_oload SUPERLU_ACC_OFFLOAD 0, 1 (default) 2d, 3d (Section 3) superlu_n_gemm SUPERLU_N_GEMM default 5000 2d (Section 3.1) superlu_max_bufer_size SUPERLU_MAX_BUFFER_SIZE default 250M words 2d, 3d (Section 3.1) superlu_num_gpu_streams SUPERLU_NUM_GPU_STREAMS default 8 2d (Section 3.1) superlu_mpi_process_per_gpuSUPERLU_MPI_PROCESS_PER_GPU default 1 3d (Section 3.2) OMP_NUM_THREADS default system dependent 2d, 3d (Section 5.4) OMP_PLACES default system dependent 2d, 3d OMP_PROC_BIND default system dependent 2d, 3d OMP_NESTED default system dependent 2d, 3d OMP_DYNAMIC default system dependent 2d, 3d SpTRSV IterReine 0: NOREFINE (default) 2d, 3d (Section 4) 1: SLU_SINGLE 2: SLU_DOUBLE Others PrintStat NO, YES (default) 2d, 3d are plotted in Fig. 7b. The samples on the Pareto front and the default one are shown in Table 5. Compared to the default coniguration, both the time and memory can be signiicantly improved by increasing the computation granularity (larger SUPERLU_MAXSUP, SUPERLU_RELAX). Also, less GPU oload (larger SUPERLU_N_GEMM) leads to better performance. ACM Trans. Math. Softw. SuperLU_DIST Release version-8 • 13 (a) 3D CPU SpLU (b) 2D GPU SpLU (c) 3D GPU SpLU 4000 4 All samples 10 Pareto optima Default 10000 1.5 All samples Pareto optima All samples Default Pareto optima 4000 Default 0 1 2 3 10 10 10 10 15 20 40 80 10 20 40 Fig. 7. Samples generated by GPTune for the three tuning experiments. Only valid samples are ploted. 5.3 3D GPU SpLU parameter tuning For the 3D GPU SpLU algorithm in Section 3.2, we use 2 NERSC Perlmutter GPU nodes with 4 MPI ranks per node and the H2O matrix. The number of OpenMP threads is set to 16, and � � � = 8. We consider the following � � � tuning parameters [ ColPerm, SUPERLU_MAXSUP, SUPERLU_RELAX, SUPERLU_MAX_BUFFER_SIZE, � , � ]. We set � � up GPTune to generate 200 samples. All samples and the Pareto front are plotted in Fig. 7c. The samples on the Pareto front and the default one are shown in Table 6. Compared to the default coniguration, both the time and memory utilization can be signiicantly improved by increasing the computation granularity and decreasing GPU bufer sizes. ColPerm=‘4’ (METIS_AT_PLUS_A) is always preferable in terms of memory usage. The efe � cts of and � are insigniicant as only 8 MPI ranks are used. SUPERLU_MAXSUP SUPERLU_RELAX num_lookaheads� � Time (s)Memory (MB) � � Default 256 60 10 16 1 5.6 2290 Tuned 31 25 17 16 1 21.9 2253 Tuned 53 35 7 4 4 1.64 2360 Table 4. Default and optimal samples returned by GPTune for the 3D CPU SpLU algorithm. Note �that is derived by � = 512/(� � ), as the total MPI count is fixed at 512. � � � ColPermSUPERLU_MAXSUP SUPERLU_RELAX SUPERLU_N_GEMM SUPERLU_MAX_BUFFER_SIZE � TimeMemory (s) (MB) Default ‘4’ 256 60 1000 2.5E8 4 20.8 6393 Tuned ‘4’ 154 154 2048 2.68E8 2 13.5 6011 Tuned ‘4’ 345 198 262144 6.7E7 2 13.2 6813 Tuned ‘4’ 124 110 8192 1.3E8 2 14.6 5976 Table 5. Default and optimal samples returned by GPTune for the 2D GPU SpLU algorithm. Note �that is derived by � = 8/� , as the total MPI count is fixed at 8. � � ACM Trans. Math. Softw. 14 • Xiaoye S. Li, Paul Lin, Yang Liu, and Piyush Sao ColPermSUPERLU_MAXSUP SUPERLU_RELAX SUPERLU_MAX_BUFFER_SIZE � � Time (s)Memory (MB) � � Default ‘4’ 256 60 2.5E8 4 1 25.3 3520 Tuned ‘4’ 176 143 1.34E8 2 1 12.1 3360 Tuned ‘4’ 327 182 1.34E8 4 2 7.4 3752 Tuned ‘4’ 610 200 3.34E7 8 1 12.5 3280 Tuned ‘4’ 404 187 3.34E7 1 2 8.76 3744 Tuned ‘4’ 232 199 3.34E7 4 2 6.7 3936 Table 6. Default and optimal samples returned by GPTune for the 3D GPU SpLU algorithm. Note�that is calculated from � and � as the total MPI count is fixed at 8. � � 5.4 Tuning of OpenMP Intra-node Parallelism SuperLU_DIST can use shared-memory parallelism on CPUs in two ways. The irst is by using the multithreaded BLAS library for linear-algebraic operations. This is independent of the implementation SuperLU_DIST itself. of The second is throughSuperLU_DIST’s direct use of OpenMP pragmas for explicit parallelization of some computations. OpenMP is portable across a wide variety of CPU architectures and operating systems. OpenMP ofers a shared- memory programming model, which can be easier to use than a message-passing programming model. In this section, we discuss the advantages and limitations of using OpenMP, and ofer some performance considerations. Advantage of OpenMP Parallelism:We have empirically observed that hybrid programming with MPI+OpenMP often requires less memory than pure MPI. This is because OpenMP does not require additional memory for message passing bufers. In most cases, correctly tuned hybrid programming with MPI+OpenMP provides better performance than pure MPI. Limitations of OpenMP Parallelism: Beneits of OpenMP parallelism is often less predictable than pure MPI parallelism because of non-determinism in the threading layer, CPU hardware, and thread ainities. Finding the right coniguration for OpenMP may take some trials and errors because performance depends on many factors: CPU architecture, number of cores and threads, the threading library being used, and the operating system. OpenMP threading may cause a signiicant slowdown if parameters are chosen incorrectly. Slowdown can be due to false-sharing, NUMA efects, hyperthreading, incorrect or suboptimal thread ainities, or underlying threading libraries. OpenMP performance tuning:To get the best performance, we recommend tuning the following OpenMP variables environment variables. OMP_NUM_THREADS governs the number of OpenMP threads thatSuperLU_DIST can use. To avoid resource over-subscription, the product of MPI processes per node and OpenMP threads should be less than or equal to available physical OMP_PLACES cores. deines where the threads may run. Possible values are cores, threads, or socket. OMP_PROC_BIND controls the thread migration. When OMP_PROC_BIND the directive is set to TRUE, OpenMP threads should not be moved; when FALSE they may move between hardware cores and sockets. In general, whenOMP_PLACES is set, the setting OMP_PROC_BIND of should be set to TRUE. The other two less commonly used OpenMP environment variables OMP_NESTED are Ðcontrols levels of nested parallelism,OMP_DYNAMIC and Ðdetermines whether to change the number of threads or thread groups dynamically. Both variables are set to False by default, which works well in most systems. For performance debugging purposes, however, we can explicitly set the two variables. In Fig. 8, we show the impact of diferent OpenMP variables and hybrid MPI-OpenMP conigurations on the SpLU speed on Cori Haswell nodes at NERSC. Figure 8a shows the best performance achieved for diferent OpenMP and NUMA settings variables for purely threaded conigurations. Figure 8b shows the performance for diferent MPI×OpenMP threads on four Haswell nodes of Cori. It should be noted that, hybrid conigurations, i.e. ACM Trans. Math. Softw. SuperLU_DIST Release version-8 • 15 conigurations with more than one OpenMP threads per MPI process, tends to require far less memory for MPI’s internal bufers 11[]. In general, using 2-8 OpenMP threads per MPI process gives good performance across a wide range of matrices. (b) Performance w.r.t MPI× OMP threads on 4 (a) Best performance for diferent OMP variables nodes 2.1 70.9 70.9 70.9 70.970.7 70.9 128 × 1 16 × 8 70 67.9 66.9 2.00 64 × 2 8 × 16 64.6 32 × 4 4 × 32 1.75 56.3 53.9 53.9 1.6 50 1.50 1.3 1.3 1.25 1.2 1.0 1.0 1.0 1.0 1.00 30 0.9 0.9 0.9 0.75 0.50 0.25 NumaAlloc OMP_NUM_THREADS OMP_PLACES OMP_PROC_BIND OMP_DYNAMIC 0.00 nd24k torso3 matrix Fig. 8. OpenMP performance tuning for SpLU on Cori Haswell nodes at NERSC. In Fig. 8a, each bar shows the best performance obtained for a variableśvalue pair by an exhaustive parametric search for the other four variables; the test matrix is torso3 and number of threads is 32. The OpenMP API allows control of these variables programmatically. This becomes useful when the application and SuperLU require diferent OpenMP conigurations. For best performance, the user can use our autotuner GPTune to tune these variables automatically, see Section 5. 6 CONCLUSION In this paper, we presented the recently added features in the distributed-memory sparse direSuperLU_DIST ct solver . They represent signiicant algorithmic advances, including (1) communication-avoiding 3D sparse LU factor- ization, (2) support of multi-GPU oloading, and (3) mixed-precision computations for higher speed and lower memory consumption. Incorporating these algorithmic changes required solving challenging software engineer- ing problems while bringing the research prototype code to the production code that is usable by the users. Furthermore, we documented the parameters that may impact the solver’s performance. The parameters we discussed include those inSuperLU_DIST the library as well as some system’s parameters related to OpenMP. Since the sparse solvers performance are sensitive to both sparse matrix structures and the underlying computer architectures, we show that an autotuner framework, such as GPTune, provides a powerful tool to help choose the best parameters setting. We have plans to incorporate several recent fruitful research results into the future releases of the software. These include (1) communication-avoiding 3D Sp13 TRSV ] and[(2) communication-hiding SpTRSV via one-sided MPI and NVSHMEM direct GPU-to-GPU communication [4, 5]. ACKNOWLEDGMENTS We are grateful to Barry Smith for building the PETSc interface for the new 3D code, for the suggestions to improve the interface of how the parameters should be exposed to the users, and for the detailed feedback of the initial manuscript. This research was supported in part by the Exascale Computing Project (17-SC-20-SC), a collaborative efort of the U.S. Department of Energy Oice of Science and the National Nuclear Security Administration, and in part ACM Trans. Math. Softw. GFLOP/sec Interleave Default 32.0 64.0 cores sockets threads false spread close false true RelativeTime 16 • Xiaoye S. Li, Paul Lin, Yang Liu, and Piyush Sao by the Scientiic Discovery through Advanced Computing (SciDAC) Program under the Oice of Science at the U.S. Department of Energy. REFERENCES [1] A. Azad, A. Buluc, X.S. Li, X. Wang, and J. Langguth. 2020. A Distributed-Memory Algorithm for Computing a Heavy-Weight Perfect Matching on Bipartite Graphs. SIAM J. Scientiic Computing 42, 4 (2020), C143śC168. [2] J. Demmel, Y. Hida, W. Kahan, X.S. Li, S. Mukherjee, and E.J. Riedy. 2006. Error Bounds from Extra-Precise Iterative Reinement. ACM Trans. Math. Softw. 32, 2 (June 2006), 325ś351. [3] J. Demmel, Y. Hida, E.J. Riedy, and X.S. Li. 2009. Extra-precise iterative reinement for overdetermined least squarA esCM problems. Transactions on Mathematical Software (TOMS)35, 4 (2009), 28. [4] Nan Ding, Yang Liu, Samuel Williams, and Xiaoye S.ALi. Message-Driv [n. d.]. en, Multi-GPU Parallel Sparse Triangular Solv.er147ś159. https://doi.org/10.1137/1.9781611976830.14 arXiv:https://epubs.siam.org/doi/pdf/10.1137/1.9781611976830.14 [5] Nan Ding, Samuel Williams, Yang Liu, and Xiaoye S. Le Li. veraging [n. d.]. One-Sided Communication for Sparse Triangular Solv.ers 93ś105. https://doi.org/10.1137/1.9781611976137.9 arXiv:https://epubs.siam.org/doi/pdf/10.1137/1.9781611976137.9 [6] X.S. Li, J.W. Demmel, J.R. Gilbert, L. Grigori, P. Sao, M. Shao, and I. Yamazaki. Sup1999. erLU Users’ Guide. Technical Report LBNL-44289. Lawrence Berkeley National Laboratory. https://portal.nersc.gov/project/sparse/superlu/ug.pdf, Last update: June 2018. [7] X. S. Li and J. W. Demmel. 1998. Making Sparse Gaussian Elimination Scalable by Static ProPiv ceedings oting. of SC98: In High Performance Networking and Computing Conference . Orlando, Florida. [8] X. S. Li and J. W. Demmel. 2003. SuperLU_DIST: A Scalable Distributed-Memory Sparse Direct Solver for Unsymmetric Linear Systems. ACM Trans. Mathematical Software29, 2 (June 2003), 110ś140. [9] Yang Liu, Wissam M. Sid-Lakhdar, Osni Marques, Xinran Zhu, Chang Meng, James W. Demmel, and Xiaoye S. Li. 2021. GPTune: Multitask Learning for Autotuning Exascale Applications. Proceedings In of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming(Virtual Event, Republic of K(PPoPP orea) ’21). Association for Computing Machinery, New York, NY, USA, 234ś246. https://doi.org/10.1145/3437801.3441621 [10] P. Sao, X. Liu, R. Vuduc, and X.S. Li. 2015. A Sparse Direct Solver for Distributed Memory Xeon Phi-accelerated Systems. 29th IEEEIn International Parallel & Distributed Processing Symposium (IPDPS) . Hyderabad, India. [11] P. Sao, R. Vuduc, and X. Li. 2014. A Distributed CPU-GPU Sparse Direct Solv Prero.c.Inof Euro-Par 2014, LNCS Vol. 8632, pp. 487-498. Porto, Portugal. [12] P. Sao, R. Vuduc, and X. Li. 2019. A communication-avoiding 3D algorithm for sparse LU factorization on heterogeneJ. ous systems. Parallel and Distributed Computing (September 2019). https://doi.org/10.1016/j.jpdc.2019.03.004 https://www.sciencedirect.com/science/ article/abs/pii/S0743731518305197. [13] P. Sao, R. Vuduc, and X.S. Li. 2019. A Communication-Avoiding 3D Sparse Triangular Solv ICS 2019: er. InInternational Conference on Supercomputing. Phoenix, AZ. [14] I. Yamazaki and X.S. Li. 2012. New Scheduling Strategies and Hybrid Programming for a Parallel Right-looking Sparse LU Factorization on Multicore Cluster Systems.PrIn oceedings of IEEE International Parallel and Distributed Processing Symposium (IPDPS 2012) . Shanghai, China. ACM Trans. Math. Softw. SuperLU_DIST Release version-8 • 17 APPENDIX A NAMING CONVENTION AND CODE DOCUMENTATION The routines in SuperLU_DIST are divided into driver routinesand computational routines . The routine names are inspired by the LAPACK and ScaLAPACK naming convention. For example, the 2D linear solver driver is pdgssvx, where ‘p’ means parallel, ‘d’ means double precision, ‘gs’ means general sparse matrix format, and ‘svx’ means solving a linear system. Below is a list of double precision user-callable routines. • Driver routines: pdgssvx (driver for the old 2D algorithms), pdgssvx3d (driver for the new 3D algorithms in Section 2). • Computational routines: pdgstrf and pdgstrs are respectively triangular factorization SpLU and triangular solve in the 2D process grid. pdgstrf3d is triangular factorization SpLU in the 3D process grid. These routines take apreprocessed linear system as an input. An experienced user can use them directly in the application code as they can provide greater lexibility. For a new user, however, using them can be cumbersome and error-prone. We recommend using driver routines, which are easier to use. • The pddrive and pddrive3d examples in the EXAMPLE/ directory call the respective driv pdgssvx ers and pdgssvx3d to solve linear systems. Other examples in the same directory,pddrive1 such as , pddrive2, etc., illustrate how to reuse the preprocessing results for a sequence of linear systems with similar structures. The Doxygen generated documentation for all the routines is available at https://portal.nersc.gov/project/sparse/ superlu/superlu_dist_code_html/. Each routine begins with a comment that breaks down input/output arguments and explains the functions of the routine. Although the original User’s Guide contains comprehensive description of various internal data structures and algorithms [6], it does not contain the new features presented here. B FORTRAN 90 INTERFACE In the FORTRAN/ directory, there are Fortran 90 module iles that implement the wrappers for Fortran programs to access the full functionality of the C functions SuperLU_DIST in . The design is based on object-oriented programming concept: deine opaque objects in the C space, which are accessed via handles from the Fortran space. All SuperLU_DIST objects (e.g., process grid, LU structure) are opaque from the Fortran side. They are allocated, deallocated and operated at the C side. For each C object, we deine a Fortran handle in Fortran’s user space, which points to the C object and implements the access methods to manipulate the object. All handles are 64-bit integer type. For example, consider creating a 3D process grid. The following code snippet shows what are involved from the Fortran and C sides. • Fortran side /* Declare handle: */ integer(64)::f_grid3d /* Call C wrapper routine to create 3D grid pointed to by "f_grid3d": */ call f_superlu_gridinit3d(MPI_COMM_WORLD, nprow, npcol, npdep, f_grid3d) • C side /* Fortran-to-C interface routine: */ void f_superlu_gridinit3d(int *MPIcomm, int *nprow, int *npcol,int *npdep, int64_t *f_grid3d) /* Actual call to C routine to create grid3d structture in *grid3d{} */ superlu_gridinit3d(f2c_comm(MPIcomm),*nprow, *npcol, *npdep, (gridinfo3d_t *) *f_grid3d); We support four datatypes: ‘s’ (FP32 real), ‘d’ (FP64 double), ‘c’ (FP32 complex) and ‘z’ (FP64 complex). Throughout the paper, we use the ‘d’ version of the routine names. ACM Trans. Math. Softw. 18 • Xiaoye S. Li, Paul Lin, Yang Liu, and Piyush Sao Here, the Fortran handlef_grid3d essentially acts as a 64-bit pointer pointing to the internal 3D grid structure, which is created by the C routine superlu_gridinit3d(). This structure (see Fig. 2) sits in the C space and is invisible from the Fortran side. For all the user-callable C functions, we provide the corresponding Fortran-to-C interface functions, so that the Fortran program can access all the C functionality. These interface routines are implemented in the iles superlu_c2f_wrap.c (precision-independent) and superlu_c2f_dwrap.c (double precision). The Fortran-to- C name mangling is handled by CMake through the headerSRC/superlu_FCnames.h ile . The module ile superlupara.f90 deines all the constants matching the enum constants deined in the C side (see Table 3). The module ile superlu_mod.f90 implements all the access methods (set/get) for the Fortran side to access the objects created in the C user space. C INSTALLATION WITH CMAKE OR SPACK C.1 Dependent external libraries One can have a bare minimum installation SuperLU_DIST of without any external dependencies, although the following external libraries are useful for high performance: BLAS, (Par)METIS (sparsity-preserving ordering), CombBLAS (parallel numerical pivoting) and LAPACK (for inversion of dense diagonal block). C.2 CMake installation The user will need to create a build tree from which to invoke CMake. The following describes how to deine the external libraries. BLAS (highly recommended) If the user has a fast BLAS library on their machine, the user can link it using the following CMake deinition: -DTPL_BLAS_LIBRARIES="<BLAS library name>" Otherwise, theCBLAS/ subdirectory contains the part of the C BLAS (single threaded) neede SuperLU_DIST d by , but it is not optimized for performance. The user can compile and use this internal BLAS with the following CMake deinition: -DTPL_ENABLE_INTERNAL_BLASLIB=ON ParMETIS (highly recommended) http://glaros.dtc.umn.edu/gkhome/fetch/sw/parmetis/parmetis-4.0.3.tar.gz The user can install ParMETIS and deine the two environment variables as follows: export PARMETIS_ROOT=<Prefix directory of the ParMETIS installation> export PARMETIS_BUILD_DIR=${PARMETIS_ROOT}/build/Linux-x86_64 Note that by default, we use serial METIS as the sparsity-preserving ordering, which is available in the ParMETIS package. The user can disable ParMETIS during installation with the following CMake deinition: -DTPL_ENABLE_PARMETISLIB=OFF. In this case, the default ordering is setMMD_AT_PLUS_A to be . See Table 3 for all the possible ColPerm options. In order to use parallel symbolic factorization function, the user needs to use ParMETIS ordering. LAPACK (highly recommended) In the triangular solve routine, we may use LAPACK to explicitly invert the dense diagonal block to improve the performance. The user can use it with the following CMake option: ACM Trans. Math. Softw. SuperLU_DIST Release version-8 • 19 -DTPL_ENABLE_LAPACKLIB=ON CombBLAS (optional) https://people.eecs.berkeley.edu/~aydin/CombBLAS/html/index.html In order to use parallel weighted matching HWPM (Heavy Weight Perfect Matching) for numerical pre- pivoting [1], the user needs to install CombBLAS and deine the environment variables: export COMBBLAS_ROOT=<Prefix directory of the CombBLAS installation> export COMBBLAS_BUILD_DIR=${COMBBLAS_ROOT}/_build Then, install with the CMake option: -DTPL_ENABLE_COMBBLASLIB=ON Use GPU The user can enable (NVIDIA) GPU with CUDA with the following CMake option: -DTPL_ENABLE_CUDALIB=TRUE The user can enable (AMD) GPU with HIP with the following CMake option: -DTPL_ENABLE_HIPLIB=TRUE For a simple installation with default settings: mkdir build ; cd build; cmake .. \ -DTPL_PARMETIS_INCLUDE_DIRS="${PARMETIS_ROOT}/include;\ ${PARMETIS_ROOT}/metis/include" \ -DTPL_PARMETIS_LIBRARIES="${PARMETIS_BUILD_DIR}/libparmetis/libparmetis.a;\ ${PARMETIS_BUILD_DIR}/libmetis/libmetis.a" \ There are a number of example build scripts example_script/ in the directory, with ilenames run_cmake_build_*.sh that target various machines. To actually build (compile), type: ‘make’. To install the libraries, type: ‘make install’. To run the installation tests, type: ‘test’. (The outputs are in ile: ‘build/Testing/Temporary/LastTest.log’) or, ‘ctest -D Experimental’, or, ‘ctest -D Nightly’. Note that the parallel execution ctest in is invoked by the "mpiexec" command, which is from the MPICH environment. If the MPI is not MPICH/mpiexec based, the test execution may fail. The user can pass the deinition option-DMPIEXEC_EXECUTABLE to CMake. For example on Cori at NERSC, the user will need the following: cmake .. -DMPIEXEC_EXECUTABLE=/usr/bin/srun. Or, the user can always go toTEST/ directory to perform testing manually. The following list summarizes the commonly used CMake deinitions. In each case, the irst choice is the default setting. After running a ‘cmake’ installation, a coniguration header ile isSRC/superlu_dist_config.h generated in , which contains the key CPP deinitions used throughout the code. -DTPL_ENABLE_INTERNAL_BLASLIB=OFF | ON -DTPL_ENABLE_PARMETISLIB=ON | OFF -DTPL_ENABLE_LAPACKLIB=OFF | ON -DTPL_ENABLE_COMBBLASLIB=OFF | ON -DTPL_ENABLE_CUDALIB=OFF | ON -DCMAKE_CUDA_FLAGS=<...> -DTPL_ENABLE_HIPLIB=OFF | ON -DHIP_HIPCC_FLAGS=<...> ACM Trans. Math. Softw. 20 • Xiaoye S. Li, Paul Lin, Yang Liu, and Piyush Sao -Denable_complex16=OFF | ON (double-complex datatype) -Denable_single=OFF | ON (single precision real datatype) -DXSDK_INDEX_SIZE=32 | 64 (integer size for indexing) -DBUILD_SHARED_LIBS= OFF | ON -DCMAKE_INSTALL_PREFIX=<...> -DCMAKE_C_COMPILER=<MPI C compiler> -DCMAKE_C_FLAGS=<...> -DCMAKE_CXX_COMPILER=<MPI C++ compiler> -DCMAKE_CXX_FLAGS=<...> -DXSDK_ENABLE_Fortran=OFF | ON -DCMAKE_Fortran_COMPILER=<MPI F90 compiler> C.3 Spack installation Spack installation SuperLU_DIST of is a fully automated process. Assume that the develop branch of Spack (https://github.com/spack/spack) is used. The user can ind available compilers spackvia: compilers. In the following, let’s assume the available compiler is gcc@9.1.0. The installation supports the following variants: Use 64-bit integer The user can enable 64-bit integer with: spack install superlu-dist@master+int64%gcc@9.1.0 Use GPU The user can enable (NVIDIA or AMD) GPUs with: spack install superlu-dist@master+cuda%gcc@9.1.0 spack install superlu-dist@master+rocm%gcc@9.1.0 Test installation The user can run a few smoke tests of the spack installation via spack test run superlu-dist@master (pick the appropriate installation if multiple variants available) ACM Trans. Math. Softw.

Journal

ACM Transactions on Mathematical Software (TOMS)Association for Computing Machinery

Published: Mar 21, 2023

Keywords: Sparse direct solver

References