Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 14-Day Trial for You or Your Team.

Learn More →

Tridigpu: A GPU Library for Block Tridiagonal and Banded Linear Equation Systems

Tridigpu: A GPU Library for Block Tridiagonal and Banded Linear Equation Systems tridigpu: A GPU library for block tridiagonal and banded linear equation systems CHRISTOPH KLEIN, Institute of Computer Engineering (ZITI), Germany ROBERT STRZODKA, Institute of Computer Engineering (ZITI), Germany In this paper we present a CUDA library with a C API for solving block cyclic tridiagonal and banded systems on one GPU . The library can process block tridiagonal systems with block sizes from 1x1 (scalar) to 4x4 and banded systems with up to 4 sub- and superdiagonals. For the compute intensive block size cases and cases with many right-hand sides, we write out an explicit factorization to memory, however, for the scalar case the fastest approach is to only output the coarse system and recompute the factorization. Prominent features of the library are (scaled) partial pivoting for improved numeric stability, highest performance kernels, which completely utilize GPU memory bandwidth, and support for multiple sparse or dense right-hand side and solution vectors. The additional memory consumption is only 5% of the original tridiagonal system, which enables the solution of systems up to GPU memory size. The performance of the state-of-the-art scalar tridiagonal solver of cuSPARSE is outperformed by factor 5 for large problem sizes of unkno 2 wns, on a GeForce RTX 2080 Ti. CCS Concepts: · Mathematics of computing → Solvers; Mathematical software performance; · Computing method- ologies → Massively parallel algorithms ; Concurrent algorithms. Additional Key Words and Phrases: tridiagonal solver, GPUs, CUDA, banded solver, gtsv, gbsv 1 INTRODUCTION tridigpu is a library for the solution of scalar and block tridiagonal linear equation systems with multiple right-hand sides x · · · x d · · · d b c a 0,0 0,n −1 0,0 0,n −1 0 0 0 r r * + * + * + . / . / x · · · x d · · · d . a b c / 1,0 1,n −1 1,0 1,n −1 1 1 1 r r . / . / . / . / . / . / x · · · x d · · · d a b c . 2,0 2,n −1 / . 2,0 2,n −1 / 2 2 2 r r . / . / . / . / AX = . . . / = . . . / = D, (1) . . . . . . / . . . . / . . . . / . . . . / . . . . . . . . . . / . / . / . / . / . / . / . / . / x · · · x d · · · d a b c ˆ ˆ ˆ ˆ ˆ ˆ ˆ . N −2,0 N −2,n −1/ . N −2,0 N −2,n −1/ N −2 N −2 N −2 r r c a b x · · · x d · · · d , - ˆ ˆ ˆ ˆ ˆ ˆ ˆ N −1 N −1 N −1 , N −1,0 N −1,n −1- , N −1,0 N −1,n −1- r r N ×N n×n n×1 withA ∈ F , small coeicient blocks a ,b ,c ∈ F , x ,d ∈ F , and i = 0, . . . , N − 1, where F is the ield i i i i i ˆ ˆ of realR or complex numbersC. MatrixA has N block rows with block size n ∈ N and N = Nn scalar rows. The number of right-hand sides of the equation system is denotedn with ∈ N. Solving scalar tridiagonal systems n = 1)( is required in many applications, e.g. in electrodynamics 23], luid[ dynamics22 [ , 26], or computer graphics25 [ ], whereas the solution of block tridiagonal systems n > 1)(is required in computational inance [15], computational luid dynamics [36], or image restoration problems [6]. Equation 1 encompasses multiple problem classes which tridigpu can solve eiciently. These problem classes guide our discussion of the C API, the algorithms, the implementations and the results throughout the paper. Authors’ addresses: Christoph Klein, christoph.klein@ziti.uni-heidelberg.de, Institute of Computer Engineering (ZITI), Im Neuenheimer Feld 368, Heidelberg, Germany, 69126; Robert Strzodka, robert.strzodka@ziti.uni-heidelberg.de, Institute of Computer Engineering (ZITI), Im Neuenheimer Feld 368, Heidelberg, Germany, 69126. Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for proit or commercial advantage and that copies bear this notice and the full citation on the irst page. Copyrights for third-party components of this work must be honored. For all other uses, contact the owner/author(s). © 2023 Copyright held by the owner/author(s). 2329-4949/2023/1-ART111 https://doi.org/10.1145/3580373 ACM Trans. Parallel Comput. 111:2 • Christoph Klein and Robert Strzodka Problem Class Cyclic cyclic tridiagonal systems with multiple right-hand a ,c sides , 0), ( 0 ˆ N −1 Problem Class Scalar scalar tridiagonal systems with multiple right-hand n =sides 1), ( Problem Class Block block tridiagonal systems with multiple right-hand n >sides 1), ( Problem Class DIA banded systems with multiple right-hand sides n > 1),( Problem Class ScalarCSC2Dense scalar tridiagonal systems with multiple sparse right-hand AX sides, = D, with a compressed sparse column encoded matrix D (see CSC format in cuSPARSE [34]), Problem Class ScalarCSC2CSC scalar tridiagonal systems with many sparse right-hand sides and a pruned incomplete result, AX = D, with CSC encoded large sparse matrices D and X. In the following we list one small example for each problem class. Obviously, one example cannot express all possible parameter variations within the corresponding problem class, but only highlight a certain characteristic aspect. Example of Problem Class Cyclic is a system with a , c , 0, n = 1, N = 6, and n = 1 0 ˆ r N −1 7 13 1 x 1 0,0 * + * + * + 2 8 14 x 2 . / . 1,0/ . / . / . / . / . / . / . / 3 9 15 x 3 2,0 . / . / . / AX = = = D. (2) . / . / . / 4 10 16 x 4 . / . 3,0/ . / . / . / . / . / . / . / 5 11 17 x 5 4,0 18 6 12 x 6 , - , 5,0- , - Example of Problem Class Scalar is a system with n = 3, N = 6, and n = 1 6 12 x x x 1 7 13 0,0 0,1 0,2 * + * + * + 1 7 13 x x x 2 8 14 . / . / . / 1,0 1,1 1,2 . / . / . / . / . / . / 2 8 14 x x x 3 9 15 2,0 2,1 2,2 . / . / . / AX = = = D. (3) . / . / . / 3 9 15 x x x 4 10 16 . / . 3,0 3,1 3,2/ . / . / . / . / . / . / . / 4 10 16 x x x 5 11 17 4,0 4,1 4,2 5 11 x x x 6 12 18 , - , - , - 5,0 5,1 5,2 Example of Problem Class Block is a system with n = 1, N = 8, N = 4, and n = 2 13 15 29 31 * + * + . / 14 16 30 32 . / . / . / . / . / 1 3 17 19 33 35 x 3 . / 0,0 . / . / * + . / . / . / 2 4 18 20 34 36 x 4 1,0 . / . / . / AX = = . / = D. (4) . / . / . / 5 7 21 23 37 39 x 5 . / . 2,0/ . / . / . / . / 6 8 22 24 38 40 x 6 3,0 . / . / , - . / . / .7/ 9 11 25 27 . / 10 12 26 28 , - , - ACM Trans. Parallel Comput. tridigpu: A GPU library for block tridiagonal and banded linear equation systems • 111:3 Example of Problem Class DIA is a pentadiagonal system with n = 1, N = 8, N = 4, and n = 2 13 15 29 1 * + * + . / 14 16 30 32 2 . / . / . / . / . / 1 3 17 19 33 x 3 . / 0,0 . / . / * + . / . / . / 4 18 20 34 36 x 4 . / 1,0 . / . / AX = = . / = D. (5) . / . / . / 5 7 21 23 37 x 5 . / . / 2,0 . / . / . / . / 8 22 24 38 40 x 6 3,0 . / , - . / . / . / . / . 9 11 25 27 / 12 26 28 8 , - , - Example of Problem Class ScalarCSC2Dense is a system with n = 6, N = 6, N = 6, n = 1, sparse matrix 6×6 D ∈ R and a dense solution 6 12 x x x x x x 1 2 3 4 0,0 0,1 0,2 0,3 0,4 0,5 * + * + * + . / . / . / 1 7 13 x x x x x x 5 6 7 1,0 1,1 1,2 1,3 1,4 1,5 . / . / . / . / . / . / 2 8 14 x x x x x x . / . 2,0 2,1 2,2 2,3 2,4 2,5/ . / AX = . / . / = . / = D. (6) . / . / . / 3 9 15 x x x x x x 8 9 3,0 3,1 3,2 3,3 3,4 3,5 . / . / . / . / . / . / . 4 10 16/ .x x x x x x / .10 / 4,0 4,1 4,2 4,3 4,4 4,5 5 11 x x x x x x 11 12 13 5,0 5,1 5,2 5,3 5,4 5,5 , - , - , - −1 The sparse matrixD contains the right-hand sides and we calculate X = A D. For largeA, this operation is only possible Dif has few columns, otherwise the dense X becomes too big to store. Example of Problem Class ScalarCSC2CSC is a system with n = 6, N = 6, N = 6, n = 1, and sparse 6×6 matricesD, X ∈ R , which is inverted exactly, but the output is written in a sparse form by omitting small values 6 12 x x 1 2 3 4 0 6 * + * + * + . / . / . / 1 7 13 x 5 6 7 . / . / . / . / . / . / 2 8 14 x x x x . / . 1 4 8 11 / . / AX = . / . / = . / = D. (7) . / . / . / 3 9 15 x x x 8 9 2 9 12 . / . / . / . / . / . / 4 10 16 10 . / . / . / 5 11 x x x 11 12 13 3 5 10 , - , - , - −1 The calculated sparse matrix X := prune(X , S ) is the dense result X := A D pruned to the sparsity pattern S (all coeicients outside S are discarded). This operation is possible even for A and large D with p p many columns because the stored solution X is sparse. The sparsity pattern S is chosen such that the largest absolute values in each column of the dense result X are included in the sparse result X. The rest of the paper is organized as follows: Section 2 discusses related work and Section 3 gives an overview of the contributions of this paper. Parallel algorithms are presented in Section 4 and their implementation details in Section 5. We conclude the paper with the numerical and performance results in Section 6. Appendix A explains the API oftridigpu from a user perspective. 2 RELATED WORK On sequential architectures the Thomas Algorithm 37] is [ the classical choice to solve tridiagonal systems. Due to the increasing amount of parallel computer architectures, parallel algorithms like the Cyclic 23]Re (CR), duction [ the Parallel Cyclic Reduction 13] (PCR), [ the Partition Metho39 d [], and the Divide and Conquer 40[] approach were later developed. The Cyclic Reduction and the Parallel Cyclic Reduction were implemented by many2,authors 7, 20, 25[, 27] for a GPU. With the diagonal pivoting from Erway14et],al. the[irst numerically stable tridiagonal solver for the ACM Trans. Parallel Comput. 111:4 • Christoph Klein and Robert Strzodka Symbol Meaning A block tridiagonal matrix or banded matrix of N ×size N N number of unknowns in the linear equation system n size of one coeicient blonck,= 1, 2, 3, 4 ˆ ˆ N number of block rows in the linear equation system N = N( n) a lower band of coeicient blocks in the block tridiagonal matrix b diagonal of coeicient blocks in the block tridiagonal matrix c upper band of coeicient blocks in the block tridiagonal matrix k number of lower bands in a banded matrix, k = 0, 1, 2, 3, 4 l l k number of upper bands in a banded matrix, k = 0, 1, 2, 3, 4 u u n number of right-hand sides of the linear equation system D dense right-hand sides; matrix ofN size × n D sparse right-hand sides encoded as a CSC matrix ofN size × n X dense solution vectors; matrix of size N × n X sparse solution vectors encoded as a CSC matrix of size N × n S sparsity pattern matrix of size N × n p r M number of block rows per partition of the tridiagonal system L number of partitions of size M, which are processed by one CUDA block nˆ number of right-hand sides, which are kept simultaneously in on-chip memory n number of right-hand sides per batch of function tridigpu<t>gtsv_csc2csc Table 1. Symbols and their meanings. The upper part refers to the interface tridigpu of , whereas the lower part refers to internal library parameters. GPU was implemented by Chang et al. [4, 5] and improved in aspects of numeric stability by Venetis et al. [38], who pointed out the problematic behaviour of diagonal pivoting and singular submatrices and therefore proposed g-spike. Two tridiagonal solvers were proposed by Diéguez et9ś al. 12,[35]: The Tree Partitioning Reduction (TPR) and a Wang and Mou implementation with their Butterly Processing Library for GPUs (WM BPLG). The tridiagonal partitioning scheme deployed in this paper is also used by18Giles ] and Lászlo et al. [et al.31 [ ] to solve many independent small tridiagonal systems with a size of up to 1000 unknowns. The same partitioning scheme is also used by Klein and Strzodka 30][to solve large non-cyclic scalar tridiagonal systems with a single right-hand side, and Kim et 28 al.],[who proposed PaScaL_TDMA, which is a tridiagonal solver without pivoting for distributed memory machines, which communicate with MPI, to solve very large tridiagonal systems. Batched block tridiagonal solv31 ers] or [ batched scalar pentadiagonal solvers 17,[19] (PCR-Penta and cuPent- Batch) were implemented on GPUs, but only Kamra and Rao 24[] solved large block tridiagonal systems, who developed a parallel algorithm (DPBWZA) for block tridiagonal toeplitz-block-toeplitz linear systems. Ghosh and Mishra17 [ ] proposed a Parallel Cyclic Reduction algorithm (PCR-Penta) to solve batched pentadiagonal systems. The above mentioned papers only treat special cases (batched, Toeplitz, pentadiagonal), whertridigpu eas our library solves a much more general problem (Eq. 1) including large cyclic block tridiagonal systems with multiple right-hand sides and support for (scaled) partial pivoting. Mor tridigpu eover, solves general block tridiagonal systems without requiring symmetry and provides a comprehensive C API for multiple value types, as well as solving and factorizing operations. ACM Trans. Parallel Comput. tridigpu: A GPU library for block tridiagonal and banded linear equation systems • 111:5 3 CONTRIBUTIONS This paper is a thorough extension of the tridiagonal solver proposed by Klein and30 Strzo ] into dkaa[ multi- functional library and makes the following new contributions: • support of diferent types of matrix coeicients and vector elements, • solution of cyclic tridiagonal systems a , 0, c( , 0), 0 ˆ N −1 • multiple right-hand sides n >(1), • performance optimizations for small scalar systems N < 10( ), • block tridiagonal systems and their factorization n > 1, n ( ≥ 1), • banded systems with up to four sub- and super-bands on each side of the diagonal, −1 • calculation A of D for sparse matricesD. 3.1 Limitations Internallytridigpu uses the Recursive Partitioned Schur Complement Algorithm (RPTS) 30], in [ order to expose suicient parallelism and solve large systems hierarchically. RPTS is based on the static partition methods, the accuracy of which are limited by the conditioning of the coarse system, which is not controlled explicitly. Therefore, ill-conditioned coarse systems can negatively afect the quality of the calculated solution, although in practice this is a seldom occurrence. In case of scalar values, (scaled) partial pivoting works as expected by comparing the absolute values and implicitly permuting the rows. In case of small coeicient a ,bblo ,c cks ), the ( determinants are compared and the i i i block rows are implicitly permuted, but no row permutations within a block row occur. This imposes assumptions on the determinants of certain blocks and thus omits some opportunities for better numerical stability in favor of a uniied code basis. 3.2 C API Overview tridigpu follows the BLAS naming convention tridigpu<t><format><operation>, where <t> can be S, D, C, Z, which represent the typesfloat, double, cuComplex, and cuDoubleComplex, respectively. As a placeholder for the corresponding types we<Tuse > and [c] denotes an optional character ’c’ in a function name <format . > is either [c]gt ’ ’ for (cyclic) general tridiagonal systems, [c]bgt’ ’ for (cyclic) block general tridiagonal systems,gbor’ for ’ general banded systems.<operation> is either sv’’ for a solving operation,f’’ for a factorizing operation, fsor ’ for ’ a solving operation with a given factorization. tridigpu provides the following functions: • tridigpu<t>[c]gtsv: This function solves Problem Class Cyclic and Problem Class Scalar, which are scalar (cyclic) generaltridiagonal systems with multiple right-hand sides. • tridigpu<t>[c]bgtsv: This function solves Problem Class Block, which cyclic) are (blockgeneraltridiagonal systems with multiple right-hand sides. • tridigpu<t>bgtf, tridigpu<t>bgtfs: These functions solve Problem Class Block, wher bgtf e factorizes the matrix explicitlybgtfs and solves the system for a speciic right-hand side and a given factorization. • tridigpu<t>gbsv: This function solves Problem Class DIA, which general are banded systems with multiple right-hand sides. • tridigpu<t>gtsv_csc2dense: This function solves Problem Class ScalarCSC2Dense, which are scalar tridiagonal systems with multiple sparse right-hand sides and dense solutions. • tridigpu<t>gtsv_csc2csc: This function solves Problem Class ScalarCSC2CSC, which are scalar tridiag- onal systems with many sparse right-hand sides and sparse solutions. ACM Trans. Parallel Comput. 111:6 • Christoph Klein and Robert Strzodka II III IV VI V Fig. 1. Matrix graph representation and matrix paterns during diferent phases of RPTS assuming no row permutations took place. The partition sizeM is equal to 7 and the system size is N = 21. A detailed explanation of the C API is located in the Appendix A. The API design follows the conventions from cuSPARSE [34] and LAPACK [3], e.g., only the factorizing function bgtf overwrites the input tridiagonal system. Except tridigpu<t>gtsv_csc2csc, all functions do not allocate any extra storage, enqueue the calculation into a CUDA stream, and are non-blocking, i.e., they may return control back to the host before the result is ready. 4 ALGORITHM The library presented in this paper is based on Rethe cursivePartitioneTdridiagonal Schur Complement algorithm with (scaled) partial pivoting. RPTS was presented by Klein and Strzo 30],dka to solv [ e non-cyclic scalar tridiagonal systems with a single right-hand side, i.e. one special n =case 1) of ( Problem Class Scalar. All presented problem classes in this paper can only be solved with the newly developed algorithms and implementations. Nevertheless, the mathematical formulation fr30 om ] is [ the basis to solve the problem classes presented here. Therefore, in the following Section 4.1 and Figure 1, we revise the irst part of Section 30]3with fromregar [ d to cyclic dependencies and block tridiagonal systems, while the following Sections 4.2- 4.4 describe how RPTS has been extended to solve the much more general problem from Equation 1. 4.1 The Recursive Partitioned Tridiagonal Schur Complement Algorithm 4.1.1 Algorithmic Description. A (cyclic) tridiagonal adjacency matrix of a graph represents a long chain of connected vertices, which is shown on the left side in Figure 1. This chain is partitioned into regular partitions of sizeM. Here, a vertex can be a scalar or a block matrix of size n, which results in Mn equations for each partition. For all partitions in parallel, downwar the ds oriented elimination of the lower block band and the upwards oriented elimination of the upper block band is realised by applying sequential elementary row operations with pivoting as shown in Figure 1 after step I. The downwards and upwards oriented eliminations both take the original matrix as input, and therefore, representN2/M independent work tasks. During this elimination, in each partition, vertical ill-ins occur on the left and right side, which also contribute to the last and irst block row of the partition. These rows form an independent set ofN2/M block equations (the coarse system) and are stored as a separate equation ACM Trans. Parallel Comput. tridigpu: A GPU library for block tridiagonal and banded linear equation systems • 111:7 system, which is again tridiagonal, as shown by step II, in Figure 1. Steps I and II reprresent eduction the stepof RPTS. The coarse system is much smaller than the ine system, e.g., withM = 37 it has a size of only 5% of the ine system. As long as the coarse system is too large the reduction step is applied recursively until it is small enough to be solved directly (step III). Then, the solution of the coarse system is substituted in the ine system (step IV) and the downwards oriented elimination is executed again but due to the substitution no ill-in occurs this time (step V). After the latter step, the solution of jthe = M − 2 block column is already available in each partition and is then substituted in the above block equation to calculate the solution j =of M − the3 column. This upwards oriented substitution continues in the same fashion until all solutions ofMthe − 2inner block rows are computed (step VI). Steps IV, V and VI represent the substitution stepof RPTS, which only exposes N /M independent work tasks because the upwards oriented substitution is executed after the downwards oriented elimination. The redundant execution of the downwards oriented elimination in the reduction and substitution step of RPTS incurs additional calculations but for scalar tridiagonal systems RPTS is bandwidth limited, and thus, reducing data movement at the cost of more arithmetic operations is beneicial. With solvingthe of the inner partitions we refer to the downwards and upwards oriented elimination in the reduction step; and to the downwards oriented elimination and upwards oriented substitution in the substitution step. 4.1.2 Mathematical Description. With the partitioning of the graph in the left part of Figure 1 into partitions of sizeM, each partition has M − 2 inner vertices and 2 interface vertices. P bLet e the index set of all inner vertices in the partition and I be the index set of all interface vertices. With respect to these index sets A can matrix be reordered by a permutationQ into the(M − 2) × (M − 2)-block diagonal A , the 2 × 2-block diagonal A , the PP I I wideA , and the tallA . The Schur-complement factorization then reads: I P PI ! ! ! A A I 0 A A T PP PI PP PI QAQ = = , (8) −1 A A A A I 0 S I P I I I P PP −1 S := A − A A A , (9) I I I P PI PP with the Schur-complement S. In Equation 8,QAQ is a permuted form of A from Equation 1. The above factorization allows the recursive solution of ! ! ! A A x d T PP PI P P QAQ x = = = d (10) A A x d I P I I I I in three steps ′ −1 Reduce: Calculation of the coarse system S and the coarse right-hand side d := d − A A d . I I P P I PP Solve coarse system: The coarse system Sx = d is solved by factorizing S recursively. ′ −1 ′ Substitute: ind := d − A x substitutex and solvex := A d for the inner vertices. P PI I I P P PP P 4.2 Cyclic Systems In case of cyclic tridiagonal systems a ,c ( , 0), the coarser system S of Equation 9 is also cyclic. During the 0 N −1 RPTS reduction step (Figure 1, steps I & II), the values a and c , or due to pivoting their scaled values, are 0 N −1 written to the upper right and lower left corner of the coarse equation system. In the higher stages, either the direct solver (step III) must be able to solve cyclic tridiagonal systems, or the system is reduced toNa=size 2. of In the latter case, the inal RPTS reduction to a 2x2 system incorporates the values a and c during the last 0 N −1 ACM Trans. Parallel Comput. 111:8 • Christoph Klein and Robert Strzodka row eliminations because the ill-in aand , cthe are located in the same column. 0 N −1 b c a b c a 0 0 0 0 0 0 * + * + a b c a b r . 1 1 1 / . 1 1 1/ . / . / . / . / l b c a b r 2 2 2 2 2 2 . / . / downwards , upwards (11) . / . / l b c a b r . 3 3 3 / . 3 3 3/ . / . / . / . / l b c a b c 4 4 4 4 4 4 c a b c a b , 5 5 5- , 5 5 5- In a system where no row permutations take place and the last variables are about to be eliminated, which a is in the downwards andc in the upwards oriented elimination, as shown in Equation 11, thelleft refers ill-in to 0 4 the same unknown as c . 4.3 Block Tridiagonal Systems To solve block tridiagonal systems, the RPTS reduction and substitution algorithms 30] arein adjuste [ d to support the non-commutativity of matrix-matrix and matrix-vector operations, and the divisions are replaced by multiplications with an inverse matrix. Algorithm 1: Scaled partial pivoting of the RPTS substitution step for block typnes, which of sizealso applies to the scalar case n = ( 1). 1 Function get_max_abs(X , X , . . . , X ) 0 1 l−1 n×n 2 P ∈ R 3 P = 0 4 for i = 0, . . . ,l − 1 do 5 for j = 0, . . . ,n − 1 do 6 for k = 0, . . . ,n − 1 do 7 P = max(| (X ) |, P ) jj i jk jj 8 end 9 end 10 end 11 return P 12 end n×n 13 b ,c ∈ R // previous equation j j n×n 14 a ,b ,c ∈ R // current equation j+1 j+1 j+1 15 m = get_max_abs(b ,c ) j j j 16 m = get_max_abs(a ,b ,c ) j+1 j+1 j+1 j+1 17 if |det(m · a )| ≤ |det(m · b )| then j j+1 j+1 j /* normal case: invert b */ 18 else /* pivot case: invert a */ j+1 19 end ACM Trans. Parallel Comput. tridigpu: A GPU library for block tridiagonal and banded linear equation systems • 111:9 We consider the previous equation b ,c and the current equationa ,b ,c , during the downwards j j j+1 j+1 j+1 oriented elimination, when either b or a is about to be eliminated. j j+1 * + .. . . / . . . . / . . . . / . / . / 0 b c j j . / (12) . / a b c . j+1 j+1 j+1 / . / . / . . . . . . . / . . . . / , - −1 For that purpose, the previous equation can be multiplied (−with 1)a b and be added to the current equation, j+1 −1 or the current equation can be multiplied (with −1)b a and be added to the previous equation. So implicitly j+1 we require that at least b or a is invertible. j j+1 Whereas the scalar pivoting decision is made based on the absolute value of a scalar, the pivoting for block types is made based on the absolute value of the determinant of the block, which is shown in Algorithm 1. To make the pivoting decisions invariant to the scaling of the linear system by a diagonal matrix, the scaled partial pivoting decision is made with respect to the maximum absolute value within each row of the equation system (diagonal matrices m and m ). Settingm and m to the identity matrix, degrades the scaled partial pivoting i i+1 i i+1 to just partial pivoting, and setting them to zero disables pivoting. 4.4 Separate Factorization and Solve An explicit factorization of the tridiagonal matrix is beneicial, if the computation cannot be hidden behind data movement any more, which is the case for block tridiagonal systems and multiple right-hand sides. After the downwards oriented elimination (step I, Figure 1), the calculated LU-factorization of the inner partitions (d ) (d ) (d ) (d ) (L + I )U := A is saved in memory, wherLe is strictly lower triangular U andupper triangular. As PP PP PP PP PP a consequence of saving the modiied coeicients, the upwards and downwards oriented elimination are not executed in parallel, but in two subsequent steps: upwards, and then downwards. The latter only afects the order of computations and the coarse system S of the factorization is the same as that of Equation 9. During the (u ) (u ) upwards oriented elimination, which computes the UL-factorization (U + I )L := A , with strictly upper PP PP PP (u ) (u ) (u ) (u ) triangular matrix U and lower triangular matrix L , the coeicients of L are discarded, butU is kept, PP PP PP PP which is required to calculate the coarse right-hand d side for a right-hand side d. Similar to the solve of scalar tridiagonal systems in Section 4.1, the solution of the inner x vertices with a given factorization is obtained by (d ) (d ) −1 ′ x := ((L + I )U ) d . PP PP P 5 IMPLEMENTATION The core implementation oftridigpu< the t><format>sv functions, consists of two CUDA kernels:rethe duction kernel and the substitution kernel , which compute the reduction and substitution step of RPTS (see Section 4.1) for arbitrary types of the matrix coeicients and vector elements. In fact, the templatized source code for solving the inner partitions is the same for all problem classes, whereas the source code, which loads the data to on-chip memory and writes the result to global memory, difers between the problem classes. The beginning of this section summarizes general implementation features, similar 30], buttoextende [ d to the general problem classes of this paper. More elaborate topics are discussed in the following subsections 5.1- 5.6. Active warps: The number of processed partitions per CUDA block is usually L = 32 because we want a large M (shallow recursion) and the product LM is limited by the amount of available shared memory per block. For L = 32, there are two CUDA warps in the reduction kernel and one CUDA warp in the substitution ACM Trans. Parallel Comput. 111:10 • Christoph Klein and Robert Strzodka ... eq. j 1 1 eq. j eq. j 1 1 0 0 ... eq. j + 1 start 0 0 eq. j + 1 eq. j + 1 Fig. 2. Pivoting decisions represented as a binary tree and encoded as the bit patern 0b000...1101. Either the current equationj or the next equation j + 1 is used to eliminate the coeficient in column j. kernel read elements written elements read & written 2N 2N 2N reduction 3N + n N 3 + n 3N + n N + (3 + n ) r r r r M M M (source) (A + D) (coarse A + coarse D) n 2N n 2N r r substitution 3N + n N + n n N 3N + 2n N + n r r r r r nˆ M nˆ M r r (source) (A + D + coarse X ) (ineX) n n r 2N 2N r 2N total 3N ( + 1) + 2n N + n (3 + n ) + n N 3N ( + 1 + n ) + (2n + 3) r r r r r r nˆ M M nˆ M r r (source) (ine system + coarse system) Table 2. Data reads and writes from global GPU memory for one stage and diferent kernelstridigpu<t>gtsv of in case of a scalar tridiagonal system. See Table 1 for the meaning of the variables. The gray bracketed expressions indicate to which variable the reads or writes correspond. kernel solving the inner partitions actively. The other threads in the CUDA block only participate during data load and store operations. Memory layout: The block tridiagonal matrix is saved in a banded format, which are threae ,bufers b , ( i i c ) of lengthN . a and c couple the irst and last equation and are non-zero for cyclic problems. The i 0 ˆ N −1 additional memory consumption of RPTS, which is required by the higher stages in the hierarchy, is very 25 20 ˆ ˆ little, and for n = 1, N = 2 or N = 2 and M = 41 amounts to less than 6% of the input data. Shared memory layout: The block tridiagonal partitions and the corresponding part of the right-hand side is loaded coalesced into shared memory, where it is subsequently processed sequentially by an active thread, which solves the inner of a partition. Note, that the shared memory consumption increases quadratically withn, e.g., going from block size n = 2 to n = 4 quadruples the consumption. Storage format of row interchanges during pivoting: The pivoting decisions can be saved in a binary tree because for each columnj we only encode if the coeicient injror owj + 1 is eliminated (see Figure 2). By saving this bit pattern in int64_t an , the row permutation can be reconstructed out of that bit pattern. Complete avoidance of SIMD divergence: Dynamic pivoting decisions tend to be implemented with if- else statements and branching. Nevertheless, the code for solving the inner partitions in the reduction and substitution kernel of RPTS are completely free of branching by replacing if-else statements with ternary operators (x = p ? a : b). The elimination of the coeicient in jcolumn is then formulated as a linear combination of two equations where the scaling factors of the equations depend on the pivoting decision. 5.1 Multiple Right-Hand Sides The reduction kernel of tridigpu coarsens the tridiagonal system, until it is small enough to be solved directly. Subsequently, the substitution kernel redundantly solves the inner partitions, but without any ill-in to obtain ACM Trans. Parallel Comput. tridigpu: A GPU library for block tridiagonal and banded linear equation systems • 111:11 the ine solution vector. For the solution of one scalar tridiagonal system the solver reads and writes in total 9N elements on the irst stage. This can be seen in Table 2, wher nˆ is e the number of right-hand sides, which are kept simultaneously in on-chip memory, such that the amount of inner partition solves is reduced. Now settingn = 1, nˆ = 1, and neglecting the terms of the coarse system with factor N /M 2 , 9N elements are read r r and written in total. With a coarse system size N /of M,2the amount of data which is read and written in higher stages is of order·92N /M = 18N /M, which is one order of magnitude smaller than the ine system N ) be(9cause typical values for M for a scalar tridiagonal system are in the range , 41]. [31For multiple right-hand sides, the tridiagonal system is only loaded once into on-chip memory in the reduction kernel because the tridiagonal system is not changed in memory during the inner partition solve. The same optimization cannot be applied to the substitution kernel because the tridiagonal system is modiied during the inner partition solve but the tridiagonal system is only loaded once nˆ for right-hand sides. In comparison to a naive repeated solver execution, which needs to processE = n 9N elements, solving n right-hand sides with one single solver execution only r r processes E = 3N (n /n + 1 + n ) elements, which in relation E is to only r r r 3N (n /nˆ + 1 + n ) r r r E /E = n 9N n 3N (1/nˆ + 1/n + 1) r r r n 9N 1/4 + 1/n + 1 (nˆ = 4) = 5/4 (n ≫ 1) ≈ ≈ 42%. 5.2 Optimizations for Small Tridiagonal Systems For a static partition size M, we observe that solving small tridiagonal systems is not utilizing all available GPU resources, which is expected because M determines the work per CUDA thread and is chosen as a trade-of between available shared memory and number of stages. For smaller M, less shared memory is used per CUDA block, therefore more CUDA blocks can be resident per streaming multiprocessor, but the size of the coarser ′ ′ tridiagonal system N increases because of N = 2N /M. For largerM, the shared memory consumption is high, and fewer CUDA blocks are resident per streaming multiprocessor, which is ineicient in the limit of only one active CUDA block because for each CUDA block only one warp is solving the inner partitions in the substitution kernel. To achieve maximum GPU performance it is essential to make use of latency hiding techniques by changing the context to another warp if the current warp is idling. The latter is impossible if only one CUDA block is resident per streaming multiprocessorM , thus, must not be chosen too large. Therefore, our aim is to decrease the number of the stagesand to increase the GPU utilization. To address the former point, we developed a fused inal stage kernel (FFSK), which is able to solve tridiagonal systems warpSize of size· M on chip, and to address the latter point we use a dynamic choice M,of which depends on the GPU occupancy. 5.2.1 Fused Final Stage Kernel. The FFSK is a kernel fusion of the reduction kernel, single thread inal stage kernel, and substitution kernel of RPTS. The single thread inal stage kernel of 30]RPTS only[uses one single ˜ ˜ CUDA thread to solve tridiagonal systems of maximumNsize = 32. N is the maximum tridiagonal system size of the inal stage tridiagonal solver. The FFSK loads the tridiagonal once system into shared memory, calculates the RPTS reduction step, saves the coarse tridiagonal system in shared memory, solves the coarse system, calculates the RPTS substitution step to obtain the ine solution, and stores the ine solution back to global GPU memory. It is possible to use other well known parallel algorithms, like the (Parallel) Cyclic Reduction, as a inal stage solver, but in our experiments we observed that these algorithms are less numerically stable. The FFSK is implemented with (scaled) partial pivoting for improved numeric stability. ACM Trans. Parallel Comput. 111:12 • Christoph Klein and Robert Strzodka Algorithm 2: Dynamic choice of the partitionMsize for each stagei of RPTS. input : N , sorted {M }, nˆ j r output: {N }, {M } i i 1 j = max ({M }) max j j ˜ ˜ 2 N = warpSize · M max 3 N = N 4 i = 0 5 M = M 0 j max 6 while N > N do 7 j = j − 1 max 8 while grid_dim(N , M ) < max_concurrent_blocks(N , M ,n ) ∧ M , M do i i i i r i 0 9 M = M i j 10 j = j − 1 11 end 12 i = i + 1 13 N = 2 · ceil(N /M ) − (N % M == 1 ? 1 : 0) i i−1 i−1 i−1 i−1 14 M = M // set M for next loop iteration i j max 15 end 16 M = get_smallest_final_stage_M(N , {M }) i i j 5.2.2 Dynamic choice of M. Algorithm 2 shows the dynamic choice of the partition size for each i of stage RPTS, where we denote an array of elements with curly braces { · }. The {M } are the sorted partition sizes, wherM eas j i is the partition sizeNand the tridiagonal system size for stage i. In Line 1 the indejx of the largest admissible i max partition size is obtained. In Line 6, the while-loop processes every stage coniguration until the system is small enough to be solved directly. In Line 8, a maximum value M for is chosen, such that the CUDA grid is at least as large as the maximum possible number of concurrently running blocks. Subsequently, in Line 13, the size of the coarser tridiagonal system is calculated, and the algorithm ends by calculating the smallest possible partition size for the inal stage solver in Line 16. 5.3 Block tridiagonal systems The kernels oftridigpu are implemented for general numerical types, which supp +,-,*,/,abs ort . Commuta- tivity for operator *’ is ’ not required as this is not fulilled for block types either/. Op ’ for erator block ’ types is equal to the multiplication with the inverted blo abs ck,calculates and the absolute value of the determinant of the block, which is required to make the pivoting decisions. In the kernel implementation, the pivoting scheme (see Section A.4) is passed in as a template parameter, which avoids any uneccessary operations at runtime. With generic programming and templates, no seperate code paths are required to support block tridiagonal systems. Instead of instantiating the kernels with scalar value types, they are instantiated with a block matrix of ixed size when solving block tridiagonal systems. 5.4 Factorization of a Block Tridiagonal Matrix The computational efort of those kernels, which always calculate the factorization anew (functions tridigpu<t><format>sv) grows as O (n ). Therefore, a large block size n prevents the kernels from running at full GPU memory bandwidth. In that case, it is eicient to precalculate the tridiagonal matrix factorization, such that no block inversions, ACM Trans. Parallel Comput. tridigpu: A GPU library for block tridiagonal and banded linear equation systems • 111:13 kernel bytes read bytes written ˆ ˆ 2 2 2 2N 2 N ˆ ˆ ˆ factorization, bgtf 3N n s (3N n + 2N n + 3 n )s + 2 s t t i M M (source) (ineA) (factorizeA d + r and r + coarse A + pivots) 2N ˆ ˆ ˆ reduction,bgtfs (2N n + nN n )s + 2N /Ms n ns r t i r t (source) (factorizeA d + ineD + pivots) (coarse D) ˆ ˆ 2N N 2 2 ˆ ˆ ˆ ˆ substitution, bgtfs (N n + 3N n + n N n + n n )s + s n N ns r r t i r t M M (source) (r + factorizeA d + ineD + coarse X + pivots) (ineX ) Table 3. Data reads and writes from global GPU memory for one stage and diferent kernelstridigpu<t>bgtf of and tridigpu<t>bgtfs. s is sizeof(T) and s is sizeof(int64_t). See Table 1 for the meaning of the variables. The gray t i bracketed expressions indicate to which variable the reads or writes correspond. Fig. 3. Banded format to block tridiagonal format conversion. Each color refers to one band. Her N =e,14, N = 7, k = 2, k = 2, and consequentlyn = 2. determinant calculations and matrix-matrix operations are calculated redundantly. The overall data movement is shown in Table 3. The factorization kernel calcuates the coarse system N /M blo (2 cks), the pivot decisions ˆ ˆ (N /M elements of typeint64_t), the inverted blocks r ,r (2N blocks), and the modiied tridiagonal system after u d the downwards oriented elimination N blo (3cks). With r and r we denote the row multiplication factors of u d the upwards and downwards oriented elimination, respectively. In the lower band of the factorized matrix, we −1 save the inverted blocks a as these values must be inverted in the upwards oriented substitution step of RPTS (Algorithm 2 in 30],[ Line 31) to obtain the solution values. Therefore, the substitution and reduction kernel, which solve a system with a given factorization, are completely free of block inversions. Given a precomputed factorization, the reduction kernel only N reads inv2erted blocksr (, r ), the pivot u d decisions, which are eiciently saved in bits, and the right-hand sides Nn blo with cks; and only the coarse right-hand sides are written to global memory. The upwards and downwards oriented elimination is calculated in parallel and only consists of matrix vector multiplications and vector additions. If we neglect the reading of the pivot decisions, the amount of read data is reduced for the reduction kernel bgtfs by of N elements in comparison tobgtsv. The substitution kernelbgtfs of reads one additional bandNofelements in comparison to bgtsv, which are the inverted blocks r of the downwards oriented elimination, and the pivot decisions. The downwards oriented elimination must be calculated for the right-hand sides with the already available inverted blocks and pivot decisions. This cannot be pre-calculated in the reduction kernel because the solution from the coarse system is not yet available. ACM Trans. Parallel Comput. 111:14 • Christoph Klein and Robert Strzodka 5.5 Banded Format to Block Tridiagonal Conversion To solve banded systems with tridigpu<t>gbsv, the system is converted into block tridiagonal form while reading the bands from global to on-chip memory, as shown in Figure 3. Afterwards, the RPTS algorithm, can be used to solve the on-chip block tridiagonal system. To save a banded matrix k with + k + 1 bands as a block tridiagonal matrix, the blocks must be ofmax size (k ,k ), and for each matrix row max(k ,k ) − 1 + abs(k − k ) l u l u l u zeros are additionally saved. Therefore, the memory consumption of a block tridiagonal matrix representation max(k ,k )−1+abs(k −k ) u u l l is increased by factor+1 in comparison to the banded matrix representation. For full k +k +1 l u pentadiagonal systemsk( = k = 2) this is an increase of 20%. Note that the block tridiagonal representation of l u the initial banded system is never saved in global GPU memory, but exclusively in on-chip memory. Only the intermediate results in the higher stages of the solver write and read block tridiagonal systems. However, this contributes little to additional data movement, because the coarse systems are small. Instead, the speed of the block tridiagonal solver is bounded by the limited thread parallelism in the computation. −1 5.6 Calculation ofA D with Sparse Right-Hand Sides and Solutions Functiongtsv_csc2csc utilizes gtsv_csc2dense and a k-selection algorithm 16],[which selects the k largest elements with respect to their absolute value, but without sorting them. The right-hand sides are processed in ′ ′ batches ofn vectors. For each n right-hand sides, gtsv_csc2dense is called once, and the k-selection algorithm r r ′ ′ is callend times, which results in n histogram 4 calculations in the worst case for single precision loating-point r r numbers. Our k-selection algorithm is similar BucketSelect to the of Alabi et al. 1],[but has a ixed histogram size of 256 bins. We exploit that positive loating point numbers (IEEE 754-1989) remain their ordering when they are reinterpreted as unsigned integers based on their bit pattern. For single precision loating point numbers, a maximum of four histograms are calculatedkth to ind the largest number. In the irst histogram, each of the 256 bins represent the occurrence of a speciic four byte preix. Afterwards, one single CUDA block calculates the preix sum and determines if an accumulated bin contains exactlyk numbers. In that case, all numbers with the byte preix represented by that bin are selected with CUB’s select_if device primitiv 33].eIf [ there is no accumulated bin, which contains k numbers, the histogram and the preix sum is calculated for the next four byte pattern until a bin contains the remaining number of required maximum elements, to obtain a byte preix, which is used to sele kct maximum the values with CUB’s select_if with respect to their absolute value. For each right-hand side, which is processegtsv_csc2csc d by , k is set to the number of non-zero values in the corresponding sparse result matrix column. 6 RESULTS For the results presented in this paper, we use a machine with a GeForce RTX 2080 Ti, CUDA 11.2.142, CentOS7, GCC 10.2.0, CUDA driver 450.57, and an Intel Xeon Silver 4110 @ 2.10 GHz. The scalar tridiagonal solver performance of RPTS was already evaluated in detail 30].in For[ performance results single-precision is chosen because the GeForce RTX 2080 Ti has very few double-precision arithmetic units. Throughout this section, we use partial pivoting if the pivoting scheme is not mentioned explicitly. 6.1 Numerical Results In this section we analyze the numerical stability of RPTS as a block tridiagonal and bande n = 2), d solv ander ( compare the results, with well established and numerically stable CPU solvers: LAP gbsv A)CK and( the sparse LU decomposition implemented in Eigen3 21], as[well as with the QR factorization implemented in cuSPARSE for the solution of batched pentadiagonal systems DgpsvInterleavedBatch ( ). For RPTS we set the partition size to M = 16 block rows, such that there is only one reduction and one substitution step for the presented test cases. ACM Trans. Parallel Comput. tridigpu: A GPU library for block tridiagonal and banded linear equation systems • 111:15 ID condition number description 1 3.45e+02 pentadiag(U (−1, 1), U (−1, 1), U (−1, 1), U (−1, 1), U (−1, 1)) 2 1.00e+00 pentadiag(U (−1, 1), U (−1, 1), 1e8*ones(N), U (−1, 1), U (−1, 1)) 3 3.45e+02 same as #1, but dll[N/2+1] = 1e-50*dll[N/2+1] 4 3.43e+02 same as #3, but dl[N/2+1] = 1e-50*dl[N/2+1], dll[N/2+2] = 1e-50*dll[N/2+2] 5 3.43e+03 same as #1, but each element ofdll, duu has 50% chance to be zero 6 7.78e+03 same as #1, but each element ofdll, dl, du, and duu has 50% chance to be zero 7 1.06e+00 pentadiag(U (−1, 1), U (−1, 1), 64*ones(N), U (−1, 1), U (−1, 1)) 8 9.98e+14 gallery(’randsvd’, N, 1e15, 2, 2, 2) 9 9.43e+14 gallery(’randsvd’, N, 1e15, 3, 2, 2) 10 1.18e+15 gallery(’randsvd’, N, 1e15, 1, 2, 2) 11 9.65e+14 gallery(’randsvd’, N, 1e15, 4, 2, 2) 12 3.55e+07 same as #1, but dll = dll*1e-50 13 6.76e+34 same as #12, but dl = dl*1e-50 14 1.02e+01 gallery(’toeppen’, N, 1, -10, 0, 10, 1) 15 1.45e+03 pentadiag(U (−1, 1), U (−1, 1), 1e-8*ones(N), U (−1, 1), U (−1, 1)) 16 1.45e+03 pentadiag(U (−1, 1), U (−1, 1), zeros(N), U (−1, 1), U (−1, 1)) 17 3.57e+04 pentadiag(ones(N-2), ones(N-1), 1e-8*ones(N), ones(N-1), ones(N-2)) 18 1.00e+00 pentadiag(ones(N-2), ones(N-1), 1e8*ones(N), ones(N-1), ones(N-2)) 19 2.56e+00 pentadiag(-ones(N-2), -ones(N-1), 8*ones(N), -ones(N-1), -ones(N-2)) 20 2.56e+00 pentadiag(-ones(N-2), ones(N-1), 8*ones(N), ones(N-1), -ones(N-2)) 21 1.51e+00 pentadiag(-ones(N-2), ones(N-1), 8*ones(N), ones(N-1), ones(N-2)) 22 1.54e+00 pentadiag(-ones(N-2), -ones(N-1), 8*ones(N), ones(N-1), U (−1, 1)) 23 1.85e+00 pentadiag(-ones(N-2), -ones(N-1), 8*ones(N), U (−1, 1), U (−1, 1)) Table 4. Pentadiagonal matrix collection for numerical-stability analysis. MATLAB functions are used. Function pentadiag(dll, dl, d, du, duu) returns a pentadiagonal matrix with main diagonal d, lower diagonals dll, dl, and upper diagonals du, duu. U (−1, 1) is the uniform distribution between one and minus one. All matrices are of the same size N . The condition number was calculated with the JacobiSVD function of the Eigen3 library for matrices ofN size= 512. matrix problem DOFs nnz bands LAPACK Eigen3 RPTS RPTS partial pivot no pivot nos2 structural 957 4 137 9 1.90e+00 9.85e-01 2.10e+00 1.03e+01 olm1000 luid dynamics 1 000 3 996 7 8.90e-15 8.12e-15 1.01e-13 9.92e-14 olm2000 luid dynamics 2 000 7 996 7 1.73e-09 1.73e-09 6.92e-09 2.55e-09 olm500 luid dynamics 500 1 996 7 6.27e-13 1.20e-12 4.86e-14 3.29e-14 olm5000 luid dynamics 5 000 19 996 7 2.43e-12 2.82e-12 7.13e-13 7.73e-13 tub1000 luid dynamics 1 000 3 996 5 1.45e-13 1.35e-13 3.89e-14 1.13e-14 linverse statistical 11 999 95 977 91.32e-11 1.46e-11 2.27e-12 2.29e-12 spmsrtls statistical 29 995 229 947 91.50e-14 2.67e-14 9.62e-11 9.62e-11 lf10000 model reduction 19 998 99 982 7 5.29e-12 3.33e-12 3.18e-13 8.52e-13 Table 5. Let: properties of matrices from the Sparse Matric Collection 8]. Right: [ forward relative error of double-precision results for numerical stability analysis of banded solvers. For the evaluation, the solution vector x is generated with a normal distribution of loating-point numbers with a mean value of 3 and standard deviation of 1. The forward relative error is then calculate |x − xd |by /|x | , t 2 t 2 where x is the calculated solutionxand is the exact solution. We create the 2x2 block tridiagonal test matrices by transforming the scalar tridiagonal test matrices 30]: from [ N /2×N /2 N ×N IfA ∈ R is a test matrix from 30][ Table 1, we obtain a 2x2 block tridiagonal test matrix A ∈ R s b ACM Trans. Parallel Comput. 111:16 • Christoph Klein and Robert Strzodka ID LAPACK Eigen3 cuSPARSE RPTS RPTS κ κ partial nopivot partial pivot no pivot(coarse system) 1 2.92e-15 2.11e-15 1.00e-14 6.49e-15 2.98e-14 3.15e+02 2.46e+02 2 1.27e-16 1.27e-16 2.57e-16 1.31e-16 1.31e-16 1.00e+00 1.00e+00 3 2.81e-15 2.05e-15 6.97e-15 5.97e-15 2.86e-14 3.15e+02 2.46e+02 4 2.92e-15 2.07e-15 8.03e-15 5.92e-15 2.91e-14 3.15e+02 2.46e+02 5 5.13e-15 1.28e-14 1.60e-14 5.39e-14 1.85e-13 6.02e+02 6.02e+02 6 9.19e-15 9.19e-15 4.97e-14 8.85e-14 5.74e-13 7.70e+02 6.45e+03 7 1.43e-16 1.42e-16 3.85e-16 1.55e-16 1.55e-16 1.05e+00 1.05e+00 8 6.97e-04 6.97e-04 4.01e-03 2.99e-04 1.32e-01 2.18e+12 2.76e+15 9 8.43e-05 1.65e-04 3.19e-03 3.77e-04 1.69e-04 4.57e+14 1.85e+14 10 4.58e-04 4.06e-04 3.09e-03 2.60e-04 2.57e-04 4.24e+01 1.84e+01 11 7.31e-04 1.53e-03 1.18e-03 3.16e-04 2.12e-01 2.06e+14 2.08e+14 12 8.90e-12 8.90e-12 5.57e-11 5.75e-10 1.21e-10 1.84e+07 1.84e+07 13 2.44e+71 2.90e+71 6.86e+70 6.18e+72 8.24e+70 3.51e+28 5.12e+40 14 3.75e-16 3.65e-16 7.98e-16 3.62e-16 3.62e-16 2.94e+00 2.94e+00 15 9.23e-15 3.31e-15 9.59e-15 1.56e-14 2.39e-13 6.14e+02 2.89e+03 16 4.40e-15 4.54e-15 2.45e-14 7.39e-15 1.62e-13 6.14e+02 2.89e+03 17 5.97e-14 9.20e-14 3.35e-13 7.93e-04 4.96e+00 6.56e+08 4.49e+10 18 1.64e-16 1.63e-16 5.02e-16 1.40e-16 1.40e-16 1.00e+00 1.00e+00 19 2.00e-16 1.99e-16 7.33e-16 1.51e-16 1.51e-16 1.86e+00 1.86e+00 20 1.64e-16 1.65e-16 4.71e-16 1.63e-16 1.63e-16 1.86e+00 1.86e+00 21 1.63e-16 1.65e-16 4.47e-16 1.59e-16 1.59e-16 1.32e+00 1.32e+00 22 1.55e-16 1.55e-16 4.09e-16 1.51e-16 1.51e-16 1.31e+00 1.31e+00 23 1.49e-16 1.49e-16 3.53e-16 1.40e-16 1.40e-16 1.51e+00 1.51e+00 Table 6. Let part: Forward relative error of double-precision results for numerical stability analysis of pentadiagonal solvers. Right part: coarse system condition for RPTS with partial pivoting and without pivoting. The matrix ID refers to the matrices in Table 4. with the same conditionAasby  (A ) ifi = j  s ⌊i /2⌋, ⌊j /2⌋ (A ) ifi − 2 = j s ⌊i /2⌋, ⌊j /2⌋−1 (A ) := (13) ij (A ) ifi + 2 = j s ⌊i /2⌋, ⌊j /2⌋+1 0 otherwise A := O A O , (14) b l r where O and O are orthogonal 2x2 block diagonal matrices. With a uniformly distributed loating-point number l r α ∈ [0, π], a single 2x2 orthogonal blockOinor O is generated by l r cos (α ) − sin (α ) . (15) sin (α ) cos (α ) We create the pentadiagonal test matrices analogously to the tridiagonal numerical test5,cases 11, 30in , 32[, 38], which are listed in Table 4. The double-precision results for 2x2 block tridiagonal matrices are shown in Table 7 and for pentadiagonal matrices in Table 6. The right part of each Table shows the conditioning of the coarse tridiagonal system, which ′ 2N has a size of N = . The conditionNA isif the coarse system could not be calculated due to numerical issues in Mn ACM Trans. Parallel Comput. tridigpu: A GPU library for block tridiagonal and banded linear equation systems • 111:17 ID LAPACK Eigen3 RPTS RPTS κ κ partial nopivot partial pivot no pivot(coarse system) 1 7.41e-15 5.80e-15 1.35e-14 1.26e-14 7.20e+02 4.68e+03 2 1.57e-16 1.57e-16 1.64e-16 1.64e-16 1.00e+00 1.00e+00 3 1.96e-16 1.90e-16 1.95e-16 1.95e-16 2.63e+02 2.63e+02 4 7.81e-15 6.80e-15 1.37e-14 1.35e-14 7.20e+02 8.75e+03 5 1.84e-15 2.54e-15 2.54e-15 9.33e-15 1.46e+02 2.33e+02 6 1.63e-16 1.54e-16 1.71e-16 1.71e-16 1.03e+00 1.03e+00 7 2.66e-16 2.74e-16 2.65e-16 2.65e-16 3.00e+00 3.00e+00 8 1.69e-04 1.99e-04 1.47e-04 3.04e-04 7.41e+12 2.51e+11 9 1.87e-05 5.08e-05 1.84e-02 6.10e-02 1.77e+15 5.36e+19 10 9.37e-05 1.22e-04 9.97e-05 6.24e-05 1.00e+00 6.14e+01 11 2.66e-03 3.14e-03 2.37e-03 1.77e-03 1.66e+14 1.32e+14 12 1.51e+06 9.51e+05 7.08e+05 5.11e+05 9.69e+22 2.48e+23 13 4.67e+00 1.88e+00 3.11e+00 2.83e+00 6.16e+17 5.74e+17 14 5.89e-03 6.91e-03 3.36e-03 7.11e+00 4.96e+14 3.23e+18 15 1.56e+02 3.34e+02 3.71e+02 NA 1.50e+20 NA 16 1.77e-15 2.79e-15 2.57e-15 4.53e-09 4.13e+01 4.13e+01 17 1.68e-16 1.58e-16 1.74e-16 1.74e-16 1.00e+00 1.00e+00 18 1.78e-16 1.77e-16 1.89e-16 1.89e-16 1.73e+00 1.73e+00 19 1.85e-16 1.88e-16 1.81e-16 1.81e-16 1.03e+00 1.03e+00 20 1.77e-16 1.77e-16 1.68e-16 1.68e-16 1.64e+00 1.64e+00 Table 7. Let part: Forward relative error of double-precision results for numerical stability analysis of 2x2 block tridiagonal solvers for matrices of size N = 1024. Right part: coarse system condition for RPTS with partial pivoting and without pivoting. Each matrix was created by transforming the scalar tridiagonal test matrices in [30] to 2x2 block tridiagonal form. the reduction step. The results of RPTS are comparable to that of Eigen3 and LAPACK, with exception of matrix 9 in Table 7; and matrix 12 and 17 in Table 6, which is caused by high coarse system condition numbers as a result of the static partitioning and blocking. In addition to the hard-to-solve synthetic problems, we analyze the forward relative error for real-world problems from the Sparse Matrix Collection in Table 5. Although the Sparse Matrix Collection contains many matrices, only few matrices have at most four sub- and super-bands on each side of the diagonal. In ive cases, RPTS achieves a better error as LAPACK or Eigen3 but for matrices olm1000 and spmsrtls, LAPACK and Eigen3 are clearly better (see Section 3.1). 6.2 Optimizations for Small Tridiagonal Systems Figure 4 displays the performance improvement of the fused inal stage kernel (FFSK) and the dynamic choice ofM in comparison to a inal stage solver with a single CUDA thread and a static choice M. As possible of parameters forM we choose M = 17, 21, 31, 41. As expected, the speedup converges to one for large problem sizes because then the reduction and substitution step of RPTS of the initial tridiagonal system dominate the runtime. Moreover, the cost of the cyclic inal stage is shown, which reduces the tridiagonal system through additional stages down to a 2x2 matrix. 6.3 Multiple Right-Hand Sides Figure 5 shows the performance of tridigpuSgtsv for a CUDA block dimension of n128,= 32 right-hand sides,M = 31, and nˆ = 4 in dependence on the system size N and in comparison to the tridiagonal solvers ACM Trans. Parallel Comput. 111:18 • Christoph Klein and Robert Strzodka 1.6 final stage solver cyclic 1.4 FFSK single CUDA thread 1.2 1.0 10 15 20 10 15 20 2 2 2 2 2 2 N N Fig. 4. Single precison scalar tridiagonal solver performance for diferent RPTS final stage solvers. Let: the corresponding speedup of FFSK relative to a final stage solver with a single CUDA thread. solver cusparseSgtsv2 cusparseSgtsv2_nopivot tridigpuSgtsv 3e−03 1e−03 3e−04 1e−04 3e−05 16 18 20 22 24 26 16 18 20 22 24 26 2 2 2 2 2 2 2 2 2 2 2 2 N N Fig. 5. Single precison scalar tridiagonal solver performance with n = 32 right-hand sides andN unknowns (Section A.4.1). of cuSPARSE [34] (gtsv2 and gtsv2_nopivot), which also support multiple right-hand sides. The equation −6 n N ·10 throughput in [MRows/s] is calculated by , with the runtime in seconds. The optimal choice nˆ and ofM runtime depends on the CUDA architecture and is mainly afected by the amount of available shared memory. For the same coniguration, Figure 7 shows the throughput to global memory for the irst launched reduction kernel and last launched substitution kernel tridigpuSgtsv of , in comparison to a simple copy kernel. The throughputs were calculated from the average time of 200 kernel executions and from the amount of accessed bytes according to Table 2. The reduction and substitution kernel may show higher throughputs as the copy kernel because they perform more load than store instructions, whereas the ratio between loads and stores is exactly one for the copy kernel. For largeN , the calculation in the reduction and substitution kernel is completely hidden behind data movement as they run at the bandwidth limit like the copy kernel. Figure 6 shows that solving multiple right-hand sides within the same kernel is highly tridigpuSgtsv beneicial: for one right-hand side n (= 1) achieves approximately 12500 MRows/s, but for n = 32 it runs at around 31500 r r ACM Trans. Parallel Comput. time [s] speedup of FFSK equation throughput [MRows/s] equation throughput [MRows/s] tridigpu: A GPU library for block tridiagonal and banded linear equation systems • 111:19 right−hand side dense sparse r 1 2 4 8 16 32 1e−02 1e−03 1e−04 16 18 20 22 24 16 18 20 22 24 2 2 2 2 2 2 2 2 2 2 N N Fig. 6. Single precison scalar tridiagonal solver performance of tridigpuSgtsv (Section A.4.1) and tridigpuSgtsv_csc2dense (Section A.4.4) with varying number of right-hand sides and N unknowns. kernel copy reduction substitution 16 18 20 22 24 2 2 2 2 2 Fig. 7. Throughput of the reduction and substitution kernel tridigpuSgtsv of for the finest tridiagonal system (Section A.4.1) withn = 32 right-hand sides andN unknowns, in comparison to a copy kernel, which copiesN single precision elements. All kernels reach the bandwidth limit of the GPU. MRows/s for largeN . The performance increase from n = 1, to n = 2, and to n = 4 is much larger than from r r r n = 4 to n > 4 due to the caching of nˆ = min (n , 4) right-hand sides in on-chip memory. The dotted lines show r r r r the further increase in the performance for sparse right-hand sides with ten non-zero entries and dense solution vectors (tridigpuSgtsv_csc2dense). As expected, the reduction in data movement speeds up the runtime of the kernels. If a CUDA block detects that all right-hand side values are zero within its part of the tridiagonal system, the calculation is skipped and zeros are written to the solution vector immediately. ACM Trans. Parallel Comput. time [s] throughput [GB/s] equation throughput [MRows/s] 111:20 • Christoph Klein and Robert Strzodka tridigpuSbgtfs tridigpuSbgtsv r 1 2 4 8 16 32 block size: 2 block size: 3 block size: 4 1e−01 1e−02 1e−03 1e−04 15 20 15 20 15 20 2 2 2 2 2 2 block size: 2 block size: 3 block size: 4 15 20 15 20 15 20 2 2 2 2 2 2 Fig. 8. Single precison block tridiagonal solver performance oftridigpu with varying number of right-hand sides and N unknowns. bgtsv (Section A.4.2) repeatedly solves the inner partitions, whereas bgtfs (Section A.4.2) uses a given factorization of the tridiagonal system for the solution. 6.4 Block Tridiagonal Systems and their Factorizations Analogously to the scalar tridiagonal solver performance analysis for multiple right-hand sides in Section 6.3, Figure 8 shows the performance of tridigpu block tridiagonal solvers. As expected, solving the system with the already factorized tridiagonal system achieves higher equation throughputs for increasing block sizes because the computationally expensive block inversions and determinent calculations have already been performed in the factorization. In this benchmark, solving the system with the previously calculated factorization is eicient for {n = 2,n ≥ 4}, {n = 3,n ≥ 4}, and {n = 4,n ≥ 1}. The performance of the factorization gtf)(in comparison r r r to the solving stepgtfs ( ) withn = 1 is shown in Figure 9. As expected, the runtime diference increases for increasing block sizes n. ACM Trans. Parallel Comput. equation throughput [MRows/s] time [s] tridigpu: A GPU library for block tridiagonal and banded linear equation systems • 111:21 block size 2 3 4 tridigpuSgtf tridigpuSgtfs 1e−02 1e−03 1e−04 15 20 15 20 2 2 2 2 N N Fig. 9. Single precison block tridiagonal factorization performance tridigpuSgtf ( ) in comparison to the solving step (tridigpuSgtfs, n = 1), which are declared in Section A.4.2. cuSPARSE ILU(0) LAPACK 512 8 solver tile size cusparseSgpsvInterleavedBatch tridigpuSgbsv 64 N bands: 5 bands: 7 bands: 9 6699.61 3417.5 1364.39 10 15 20 25 10 15 20 25 10 15 20 25 2 2 2 2 2 2 2 2 2 2 2 2 Fig. 10. Single precision banded solver performance. FortridigpuSgbsv (Section A.4.3), the maximum equation throughputs are shown. ACM Trans. Parallel Comput. equation throughput [MRows/s] time [s] equation throughput [MRows/s] 111:22 • Christoph Klein and Robert Strzodka 6.5 Banded Systems For optimal performance results on the RTX 2080 Ti, we choose a CUDA block dimensionnof=128, 2, M{ = 13, L = 32}, {n = 3, M = 7, L = 32}, and {n = 4, M = 7, L = 16}, where L denotes the number of partitions per CUDA block. For banded or block tridiagonal systems, M is rather small, due to the shared memory consumption of each block growing quadratically with the block n. For size n = 4, only half a warp is solving the inner partitions in the substitution kernel, whereas still two half warps are active in the reduction kernel, due to the parallel upwards and downwards oriented elimination. We compare the performance against LAP gbsv ACK’s running on the CPU, and against the interleaved batched pentadiagonal solv SgpsvInterleavedBatch er ( ) as well as the ILU(0) solver from cuSPARSE. The ILU(0) solver for fully populated banded systems, is identical to the LU factorization because no additional ill-in occurs. The performance shown for the ILU(0) does not include the time which is required for the calculation of the L und USgpsvInterleavedBatch factors. is optimized for many small pentadiagonal systems, which it into on-chip memory. The banded solv tridigpu e of , can also solve many small systems when batched together into one, but its hierarchical stages are actually designed to solve large pentadiagonal systems. Hence, this comparison is in fav SgpsvInterleavedBatch or of , but this is the only other GPU library with explicit support for at least certain pentadiagonal systems. Figure 10 shows the achieved equation throughput for diferent solvers, number of bands in the matrix and tile sizes. The tile size only afects the performance of the cuSPARSE solvers, and not surprisingly, the performance drops below 3 MRows/s when a single system of size N is solved. For a tile size N ,of one pentadiagonal system is solved, and for a tile size of N /64 64,pentadiagonal systems are solved with one function call. The ILU(0) factorization, which can be used to solve pentadiagonal systems exactly, performs poorly if all bands are full with non-zero entries and thus the solving step is calculated with less parallelism. For pentadiagonal systems, we obtain the amount of bytes readSgpsvInterleavedBatch by forN = 2 from NVIDIA Nsight Compute as tile size 512 64 8 bytes read [MB] 46.149 45.528 32.327 although the size of the 5 bands and right-hand side is equal to 25.166 MB. The reduction and substitution kernel oftridigpu read 51 MB on the irst stage, which is more than twice of the pentadiagonal system, due to reading the coarse solution in the substitution kernel. Nevertheless, tridigpu the banded solve achieves the same performance as the interleaved batched solve from cuSPARSE with a tile size of 64N for . Alarge s expected the equation throughput decreases if the number of bands are increasetridigpu d for due to the increased amount of computational efort, in particular the block inversions with a comple O (n xity ). of −1 6.6 Calculation ofA D with Sparse Right-Hand Sides and Solutions We solve a scalar tridiagonal system AX = D with many right-hand sides D with function gtsv_csc2csc. The results forgtsv_csc2dense with a sparse right-hand side and a dense solution are shown in SectionD6.3. , For we take the transposed matrices of the Sparse Matrix Collection, which are listed in Table 8, and set the CSC column pointers of the result X equal to those ofD. Note, that the matrices from the Sparse Matrix Collection are only used as right-hand sides, and are not inverted. With diferent sizes and number of non-zero elements they provide a test set for analyzing the function’s performance depending on diferent sparsity patterns of the right-hand sides. The coeicients of the tridiagonalAmatrix are generated with a uniform distribution between minus one and one and the number of right-hand sides, which are processed in one batch, is set to n = 128. The corresponding runtimes can be seen in Figure 11. More than half of the runtime is consumed by the selection of the largest absolute values of the dense tridiagonal solver result. single For 10precision elements, the histogram256 calculating kernel ofk-sele the ction algorithm runs with 320 GB/s on the RTX 2080 Ti, and is limited by the inter-warp aggregated irregular atomic additions in shared memory. In comparison to CUB’s ACM Trans. Parallel Comput. tridigpu: A GPU library for block tridiagonal and banded linear equation systems • 111:23 kernel gtsv_csc2dense k_selection matrix Fig. 11. gtsv_csc2csc runtime in single precision for many sparse right-hand sidesD from the Sparse Matrix Collection (Section A.4.5). Name Problem DOFs nnz max_element(nnz ) ecology2 2D/3D 999 999 4 995 991 5 thermal2 thermal 1 228 045 8 580 313 11 geo_1438 structural 1 437 960 63 156 690 57 atmosmodl luid dynamics 1 489 752 10 319 760 7 hook_1498 structural 1 498 023 60 917 445 93 ml_geer structural 1 504 002 110 879 972 74 af_shell10 structural 1 508 065 52 672 325 35 g3_circuit circuit simulation 1 585 478 7 660 826 6 transport structural 1 602 111 23 500 731 15 Table 8. Matrices from the Sparse Matrix Collection 8] [which are used as sparse right-hand sidesD. max_element(nnz ) is the maximum number of non-zeros per row. device histogram 33[], that is a speedup of 1.57. The overall runtime is in order of hundreds of seconds for the chosen matrices, which is caused by the runtime complexity O (n of N ) for aN ×n matrixD, as the hierarchical r r solve of one tridiagonal system is O (Nin ). A faster alternative would be to drop coeicients early D in during the solve, similar to how ILU drops coeicients early A during in a factorization. However, this introduces loss of precision and here we focus on the accuracy of the result. IfA contains many independent small tridiagonal systems other algorithms, which dynamically solve only the independent tridiagonal systems with non-zero right-hand sides, are more eicient. 7 SUMMARY AND OUTLOOK In this section we discuss the limitations and future work for the presented function classes. ACM Trans. Parallel Comput. time[s] ecology2 thermal2 Geo_1438 atmosmodl Hook_1498 ML_Geer af_shell10 G3_circuit Transport 111:24 • Christoph Klein and Robert Strzodka The scalar tridiagonal solve for multiple right-hand sides [c]gtsv ( ) is only limited by GPU memory bandwidth if the GPU has a suicient amount of computational units for the corresponding precision. These functions run at hardware limit and cannot be further optimized. The block tridiagonal solve and factorization are limited by thread execution latencies. In comparison to the scalar solves, the block tridiagonal solves have an increased computational efort, which cannot be hidden behind global memory accesses any more if only one warp is solving the inner partitions. For the future, this motivates us to increase the amount of parallelism in the matrix-matrix or matrix-vector operations during the solving of the inner partitions. However, this is not just a change in the parameterization, a signiicantly diferent parallelization strategy would be required. This new parallelization should overcome the limitation n of ≤ 4 to some extent, maybe up to n ≤ 16. But if we wanted to process even larger n eiciently, then yet another parallel strategy would have to be invented. The sparse tridiagonal solves are eicient for sparse right-hand side input vectors, which it into GPU memory cache because each CUDA block reads all right-hand side elements. Approximately 60% of the runtime ofgtsv_csc2csc is limited by the performance of k-sele the ction algorithm. Therefore, increasing the performance of the device histogram by stronger data privatization or merging k-selethe ction into the substitution kernel of the scalar tridiagonal solver will be investigated. To ensure performance portability across diferent GPU architectures, an auto-tuning mechanism to determine the optimal kernel parameters like the block dimension or partition M would size be useful in the future. 8 CONCLUSION Our tridigpu CUDA library for block cyclic, tridiagonal, and banded systems covers multiple problem classes which often appear in practice. We have discussed the C API, the underlying algorithms, the numerical stability, and the performance of these problem classes in detail. In comparison, we showed that our solvers outperform state-of-the-art tridiagonal solvers as they are implemented in cuSPARSE, and that our banded solver keeps up with batched pentadiagonal solvers, although it is able to solve large pentadiagonal systems and not only batched systems. REFERENCES [1] Tolu Alabi, Russel Steinbach, Jefrey D. Blanchard, and Bradley Gordon. 2012. Fast k-Selection Algorithms for Graphics Processing Units.ACM Journal of Experimental Algorithmics 17, 4 (2012), 4.1ś4.29. https://doi.org/10.1145/2133803.2345676 [2] Pablo Alfaro, Pablo Igounet, and Pablo Ezzatti. 2012. A study on the implementation of tridiagonal systems solvers using a GPU. In Proceedings - International Conference of the Chilean Computer Science Society, SCCC . IEEE, 219ś227. https://doi.org/10.1109/SCCC.2011.29 [3] E Anderson, Z Bai, C Bischof, S Blackford, J Demmel, J Dongarra, J Du Croz, A Greenbaum, S Hammarling, A McKenney, and D Sorensen. 1999. LAPACK Users’ Guide(third ed.). Society for Industrial and Applied Mathematics, Philadelphia, PA. [4] Li-Wen Chang. 2014.Scalable parallel tridiagonal algorithms with diagonal pivoting and their optimization for many-core archite . ctures Ph.D. Dissertation. [5] Li Wen Chang, John A. Stratton, Hee Seok Kim, and Wen Mei W. Hwu. 2012. A scalable, numerically stable, high-performance tridiagonal solver using GP International Us. Conference for High Performance Computing, Networking, Storage and Analysis, (2012). SC https://doi.org/10.1109/SC.2012.12 [6] Xiaojun Chen and Weijun Zhou. 2010. Smoothing nonlinear conjugate gradient method for image restoration using nonsmooth nonconvex minimization. SIAM Journal on Imaging Sciences3, 4 (2010), 765ś790. https://doi.org/10.1137/080740167 [7] Andrew Davidson, Yao Zhang, and John D. Owens. 2011. An auto-tuned method for solving large tridiagonal systems on the GPU. Proceedings - 25th IEEE International Parallel and Distributed Processing Symposium, IPDPS 2011 (2011), 956ś965. https://doi.org/10.1109/ IPDPS.2011.92 [8] Timothy A. Davis and Yifan Hu. 2011. The University of Florida Sparse Matrix Colle ACM Tction. rans. Math. Software 38, 1 (2011). https://doi.org/10.1145/2049662.2049663 [9] Adrian Perez Dieguez, Margarita Amor, and Ramon Doallo. 2016. New Tridiagonal Systems Solvers on GPU Archite Proce ctur edings es. - 22nd IEEE International Conference on High Performance Computing, HiPC 2015 (2016), 85ś93. https://doi.org/10.1109/HiPC.2015.17 ACM Trans. Parallel Comput. tridigpu: A GPU library for block tridiagonal and banded linear equation systems • 111:25 [10] Adrián P. Diéguez, Margarita Amor, and Ramón Doallo. 2019. Parallel preix operations on GPU: tridiagonal system solvers and scan operators. Journal of Supercomputing75, 3 (2019), 1510ś1523. https://doi.org/10.1007/s11227-018-2676-z [11] Adrián P. Diéguez, Margarita Amor, and Ramón Doallo. 2019. Tree partitioning reduction: A new parallel partition method for solving tridiagonal systems. ACM Trans. Math. Software 45, 3 (2019). https://doi.org/10.1145/3328731 [12] Adrián Pérez Diéguez, Margarita Amor, Jacobo Lobeiras, and Ramón Doallo. 2018. Solving large problem sizes of index-digit algorithms on GPU: FFT and tridiagonal system solvers. IEEE Trans. Comput. 67, 1 (2018), 86ś101. https://doi.org/10.1109/TC.2017.2723879 [13] J.W. Eastwood. 1982. Parallel computers: Architecture, programming and algorithms . Vol. 27. Bristol. 104 pages. https://doi.org/10.1016/ 0010-4655(82)90016-9 [14] Jennifer B. Erway, Roummel F. Marcia, and Joseph A. Tyson. 2010. Generalized diagonal pivoting methods for tridiagonal systems without interchanges. IAENG International Journal of Applied Mathematics 40, 4 (2010), 1ś7. [15] S Foulon and K.J. in ’t Hout. 2010. Adi Finite Diference Schemes for OptionInternational Pricing. Journal of Numerical Analysis and Modeling7, 2 (2010), 303ś320. [16] Peter Frolkovič. 1990. Numerical recipes: The art of scientiic computing. Acta Applicandae Mathematicae19, 3 (jun 1990), 297ś299. https://doi.org/10.1007/bf01321860 [17] ABHIJIT GHOSH and CHITTARANJAN MISHRA. 2021. A parallel cyclic reduction algorithm for pentadiagonal systems with application to a convection-dominated heston pde. SIAM Journal on Scientiic Computing 43, 2 (jan 2021), C177śC202. https://doi.org/10.1137/ 20M1311053 [18] Mike Giles, Endre Laszlo, Istvan Reguly, Jeremy Appleyard, and Julien Demouth. 2014. GPU Implementation of Finite Diference Solvers. Proceedings of WHPCF 2014: 7th Workshop on High Performance Computational Finance - Held in conjunction with SC 2014: The International Conference for High Performance Computing, Networking, Storage and Analysis (2014), 1ś8. https://doi.org/10.1109/WHPCF.2014.10 [19] Andrew Gloster, Lennon Ó Náraigh, and Khang Ee Pang. 2019. cuPentBatchÐA batched pentadiagonal solver for NVIDIA GPUs. Computer Physics Communications 241 (2019), 113ś121. https://doi.org/10.1016/j.cpc.2019.03.016 [20] Dominik Göddeke and Robert Strzodka. 2011. Cyclic reduction tridiagonal solvers on GPUs applied to mixed-precision IEEE multigrid. Transactions on Parallel and Distributed Systems 23, 1 (2011), 22ś32. https://doi.org/10.1109/TPDS.2010.61 [21] Gaël Guennebaud, Benoît Jacob, and Others. 2012. Eigen v3. http://eigen.tuxfamily.org. http://eigen.tuxfamily.org [22] George R. Halliwell. 2004. Evaluation of vertical coordinate and vertical mixing algorithms in the HYbrid-Coordinate Ocean Model (HYCOM). Ocean Modelling7, 3-4 (2004), 285ś322. https://doi.org/10.1016/j.ocemod.2003.10.002 [23] R. W. Hockney. 1965. A Fast Direct Solution of Poisson’s Equation Using Fourier Analysis. Journal of the ACM (JACM) 12, 1 (1965), 95ś113. https://doi.org/10.1145/321250.321259 [24] Rabia Kamra and S. Chandra Sekhara Rao. 2021. A stable parallel algorithm for block tridiagonal toeplitzśblockśtoeplitz linear systems. Mathematics and Computers in Simulation 190 (2021), 1415ś1440. https://doi.org/10.1016/j.matcom.2021.07.019 [25] Michael Kass, Aaron Lefohn, and John Owens. 2006. Interactive Depth of Field Using Simulated Difusion Computing on a GPU.January 2006 (2006), 1ś8. [26] Michael Kass and Gavin Miller. 1990. Rapid, stable luid dynamics for computer graphics. 24, 4 (1990), 49ś57. https://doi.org/10.1145/ 97879.97884 [27] Hee Seok Kim, Shengzhao Wu, Li Wen Chang, and Wen Mei W. Hwu. 2011. A scalable tridiagonal solver for ProGP ceedings Us. of the International Conference on Parallel Processing(2011), 444ś453. https://doi.org/10.1109/ICPP.2011.41 [28] Ki Ha Kim, Ji Hoon Kang, Xiaomin Pan, and Jung Il Choi. 2021. PaScaL_TDMA: A library of parallel and scalable solvers for massive tridiagonal systems. Computer Physics Communications 260 (2021), 107722. https://doi.org/10.1016/j.cpc.2020.107722 [29] Christoph Klein and Robert Strzodka. [n.d.]. tridigpu. https://mp-force.ziti.uni-heidelberg.de/asc/code/tridigpu [30] Christoph Klein and Robert Strzodka. 2021. Tridiagonal GPU Solver with Scaled Partial Pivoting at MaximumABandwidth. CM In International Conference Proceeding Series . ACM, New York, NY, USA, 1ś10. https://doi.org/10.1145/3472456.3472484 [31] Endre László, Mike Giles, and Jeremy Appleyard. 2016. Manycore algorithms for batch scalar and block tridiagonal ACMsolv Trans. ers. Math. Software 42, 4 (2016). https://doi.org/10.1145/2830568 [32] Jingmei Li, Zhigao Zheng, Qiao Tian, Guoyin Zhang, Fangyuan Zheng, and Yuanyuan Pan. 2017. Research on tridiagonal matrix solver design based on a combination of processors. Computers and Electrical Engineering 62 (2017), 1ś16. https://doi.org/10.1016/j. compeleceng.2017.07.014 [33] Duane Merrill. 2021. CUB. https://nvlabs.github.io/cub [34] NVIDIA. 2022. CUSPARSE library. In NVIDIA Corporation, Santa Clara, California . https://docs.nvidia.com/cuda/cusparse [35] Adrian Perez DIeguez, Margarita Amor Lopez, and Ramon Doallo Biempica. 2018. Solving Multiple Tridiagonal Systems on a Multi-GPU Platform.Proceedings - 26th Euromicro International Conference on Parallel, Distributed, and Network-Based Processing, PDP 2018 (2018), 759ś763. https://doi.org/10.1109/PDP2018.2018.00123 [36] Thomas H. Pulliam. 1986. Implicit solution methods in computational luidApplie dynamics. d Numerical Mathematics 2, 6 (1986), 441ś474. https://doi.org/10.1016/0168-9274(86)90002-4 ACM Trans. Parallel Comput. 111:26 • Christoph Klein and Robert Strzodka [37] L Thomas. 1949. Elliptic problems in linear diferential equations overWaatson netwSci. ork. Comput. Lab Report, Columbia University, New York 1 (1949), 42. [38] I. E. Venetis, A. Kouris, A. Sobczyk, E. Gallopoulos, and A. H. Sameh. 2015. A direct tridiagonal solver based on Givens rotations for GPU architectures.Parallel Comput. 49 (2015), 101ś116. https://doi.org/10.1016/j.parco.2015.03.008 [39] H. H. Wang. 1981. A Parallel Method for Tridiagonal Equations. ACM Transactions on Mathematical Software (TOMS)7, 2 (1981), 170ś183. https://doi.org/10.1145/355945.355947 [40] Xiaojing Wang and Z. G. Mou. 1991. A divide-and-conquer method of solving tridiagonal systems on hypercube massively parallel computers. Proceedings of the 3rd IEEE Symposium on Parallel and Distributed Processing 1991 (1991), 810ś817. https://doi.org/10.1109/ SPDP.1991.218237 [41] Technical Whitepaper. 2010. Semantic Versioning Semantic Versioning. , 10 pages. ACM Trans. Parallel Comput. tridigpu: A GPU library for block tridiagonal and banded linear equation systems • 111:27 APPENDIX A TRIDIAGONAL LIBRARY tridigpu [29] is published under the the 3-Clause BSD License and is intended to be backward compatible at the source level, and follows the semantic versioning guidelines introduced by Tom Preston-Werner [41]. A.1 Naming Conventions We follow the general BLAS naming conventions and provide a C-API tridigpu<t><format><operation>, where <t> can be S, D, C, Z, which represent the typesfloat, double, cuComplex, and cuDoubleComplex, respectively. As a placeholder for the corresponding types we<Tuse >. We use [c] to indicate an optional letter c in the function name, which refers to the version of the function for cyclic systems. A.2 Data Format and Layouts A.2.1 Dense matrices. All dense matrices are saved in column-major order. Therefore, the multiple right-hand sidesX are saved consecutively in memory. E.g. the right-hand sides of Equation 3 are saved as D = [1, 2, 3, 4, 5, 6, 7, 8, ..., 17, 18]. A.2.2 Tridiagonal matrices. The tridiagonal system is saved in a banded format, which are three separate bufers (a , b , c ) each of lengthNn . Thus, matrixA of Equation 2 is saved as i i i a = [ 1, 2, 3, 4, 5, 6] b = [ 7, 8, 9, 10, 11, 12] c = [13, 14, 15, 16, 17, 18], whereas a[0] = 0 and c[5] = 0 for the tridiagonal system of Equation 3. A.2.3 Block tridiagonal matrices. For block tridiagonal systems, each block is saved as a dense matrix in column- major order, and the tridiagonal bands of Equation 4 are saved as a = [ 0, 0, 0, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12] b = [13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28] c = [29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 0, 0, 0, 0]. A.2.4 Banded matrices. We save banded matrices from the the lowest to the highest band. . Thus, the banded matrix from Equation 5 is saved as kl = ku = 2 AB = [ *, *, 1, 4, 5, 8, 9, 12, *, 14, 3, 18, 7, 22, 11, 26, 13, 16, 17, 20, 21, 24, 25, 28, 15, 30, 19, 34, 23, 38, 27, *, 29, 32, 33, 36, 37, 40, *, *]. Analogously to the scalar case in Section A.2.2, the placeholders * contain coeicients in case of a cyclic system and are unused otherwise. The bufer has a length of N (k + k + 1), where k , k is the number of bands below l u l u and above the diagonal band, respectively. The diagonal is always included in this format, and the distances of the bands to the diagonal are ixed, e.g., for the lower bands: −k , −(k − 1), . . . , −1. l l ACM Trans. Parallel Comput. 111:28 • Christoph Klein and Robert Strzodka A.2.5 Sparse matrices. Sparse matrices are saved in a compressed sparse column format (CSC) with zero-based indexing. The CSC matrix D of Equation 6 is saved as values = [1, 5, 10, 11, 2, 12, 3, 8, 13, 6, 9, 4, 7] row_indices = [0, 1, 5, 6, 0, 6, 0, 3, 6, 1, 3, 0, 1] col_ptrs = [0, 4, 6, 9, 11, 12, 13]. A.3 Resource Consumption RPTS is a hierarchical algorithm, which must save intermediate results during the processing of higher stages. The amount of required temporary memory depends on various parameters like the data type, the partition size, the number of stages in the hierarchy, the size of the blocks, and the number of right-hand sides. When the original tridiagonal equation system size is calculate N (n d+with 3), a scalar tridiagonal solve in single- precision with tridigpuSgtsv requires approximately between 5% and 11% as temporary memory, whereas cusparseSgtsv2 [34] requires between 100% and 110% of the tridiagonal system byte size. Mortridigpu eover, does not require any global initialization or inalization functions. A.4 C API In this section the function declarations of the C-tridigpu API of are presented. Most functions take a pa- rameter to control the pivoting scheme, which is enum anof typetridigpu_pivoting_t with the instances TRIDIGPU_NO_PIVOTING, TRIDIGPU_PARTIAL_PIVOTING, TRIDIGPU_SCALED_PARTIAL_PIVOTING, and TRIDIGPU_DEFAULT_PIVOTING. Many functions require a (tempo- rary) bufer of GPU memory for execution, the size of which is returned by the corresptridigpu< onding t><format><operation>_bufferSizeExt function. ACM Trans. Parallel Comput. tridigpu: A GPU library for block tridiagonal and banded linear equation systems • 111:29 A.4.1 Solve (cyclic) scalar tridiagonal systems with multiple right-hand sides. This function solves Problem Class Cyclic and Problem Class Scalar, which are (cyclic) general tridiagonal systems. tridigpu_status_t tridigpu<t>[c]gtsv_bufferSizeExt(tridigpu_pivoting_t p, int N, int n_r, size_t* buffer_size_in_bytes) tridigpu_status_t tridigpu<t>[c]gtsv(cudaStream_t stream, tridigpu_pivoting_t p, int N, int n_r, const <T>* a, const <T>* b, const <T>* c, <T>* X, const <T>* D, void* p_buffer) The function does not allocate any extra storage, enqueues the calculation into the CUDA stream, and is non- blocking, i.e., it may return control back to the host before the result is ready. For the solution of cyclic systems, cgtsv is provided. Input Description p recommended value:TRIDIGPU_DEFAULT_PIVOTING (see above) N size of the tridiagonal system n_r number of right-hand sides a lower band of tridiagonal matrix (length N) b diagonal of tridiagonal matrix (Nlength ) c upper diagonal of tridiagonal matrix (Nlength ) D dense right-hand side matrix and bufer of length N * n_r p_buffer temporarily required memory bufer on the GPU Output Description X dense solution matrix and bufer of length N * n_r. Bufer D may be passed also asX, then the solution replaces the right-hand side data after kernel execution. Example: size_t buffer_size_in_bytes = 0; if (tridigpuSgtsv_bufferSizeExt(p, N, n_r, &buffer_size_in_bytes)) { printf("determining temporary buffer size failed\n"); void* p_buffer; if (cudaMalloc(&p_buffer, buffer_size_in_bytes)) { printf("allocating temporary buffer failed\n"); if (tridigpuSgtsv(stream, p, N, n_r, a, b, c, X, D, p_buffer)) { printf("solving tridiagonal system failed\n"); ACM Trans. Parallel Comput. 111:30 • Christoph Klein and Robert Strzodka A.4.2 Solve block tridiagonal systems with multiple right-hand sides. The functions in this Section solve Problem Class Block. tridigpu_status_t tridigpu<t>[c]bgtsv_bufferSizeExt(tridigpu_pivoting_t p, int N, int n_r, int n, size_t* buffer_size_in_bytes) tridigpu_status_t tridigpu<t>[c]bgtsv(cudaStream_t stream, tridigpu_pivoting_t p, int N, int n_r, int n, const <T>* a, const <T>* b, const <T>* c, <T>* X, const <T>* D, void* p_buffer) The function does not allocate any extra storage, enqueues the calculation into the CUDA stream, and is non- blocking, i.e., it may return control back to the host before the result is ready. The function is available for block (bgtsv) and cyclic block general tridiagonal systems cbgtsv ( ). Input Description p recommended value:TRIDIGPU_DEFAULT_PIVOTING (see above) N size of the block tridiagonal system n_r number of right-hand sides n size of one block; allowed values are 1, 2, 3, 4 a lower band of tridiagonal matrix (length N * n * n) b diagonal of tridiagonal matrix (Nlength * n * n) c upper diagonal of tridiagonal matrix (Nlength * n * n) D dense right-hand side matrix and bufer of length N * n_r * n p_buffer temporarily required memory bufer on the GPU Output Description X dense solution matrix and bufer of length N * n_r * n. Bufer D may be passed also asX, then the solution replaces the right-hand side data after kernel execution. Example: size_t buffer_size_in_bytes = 0; if (tridigpuSbgtsv_bufferSizeExt(p, N, n_r, n, &buffer_size_in_bytes)) { printf("determining temporary buffer size failed\n"); void* p_buffer; if (cudaMalloc(&p_buffer, buffer_size_in_bytes)) { printf("allocating temporary buffer failed\n"); if (tridigpuSbgtsv(stream, p, N, n_r, n, a, b, c, X, D, p_buffer)) { printf("solving block tridiagonal system failed\n"); ACM Trans. Parallel Comput. tridigpu: A GPU library for block tridiagonal and banded linear equation systems • 111:31 Functionbgtsv factorizes the matrix with every function call, wher bgtfeas calculates the factorization explicitly once, and bgtfs uses it to compute the solution from a given right-hand side. The explicit factorization is more eicient if the kernels tridigpu of are not memory bound any more. tridigpu_status_t tridigpu<t>bgtf_factorizationBufferSizeExt(tridigpu_pivoting_t p, int N, int n, size_t* factorization_buffer_size_in_bytes) tridigpu_status_t tridigpu<t>bgtf(cudaStream_t stream, tridigpu_pivoting_t p, int N, int n, <T>* a, <T>* b, <T>* c, void* p_factorization_buffer) The function does not allocate any extra storage, enqueues the calculation into the CUDA stream, and is non- blocking, i.e., it may return control back to the host before the result is ready. Input Description p recommended value:TRIDIGPU_DEFAULT_PIVOTING (see above) N size of the block tridiagonal system n size of one block; allowed values are 1, 2, 3, 4 a lower band of tridiagonal matrix (length N * n * n) b diagonal of tridiagonal matrix (Nlength * n * n) c upper diagonal of tridiagonal matrix (Nlength * n * n) Output Description a lower band of factorized tridiagonal matrixN(length * n * n) b diagonal of factorized tridiagonal matrixN (*length n * n) c upper diagonal of factorized tridiagonal matrixN (*length n * n) p_factorization_buffer bufer to save additional data required to factorize the matrix The tridiagonal system is solved for a given factorization and a corresponding right-hand bgtfs side . with tridigpu_status_t tridigpu<t>bgtfs_bufferSizeExt(tridigpu_pivoting_t p, int N, int n_r, int n, size_t* buffer_size_in_bytes) tridigpu_status_t tridigpu<t>bgtfs(cudaStream_t stream, tridigpu_pivoting_t p, int N, int n_r, int n, const <T>* a, const <T>* b, const <T>* c, <T>* X, const <T>* D, const void* p_factorization_buffer, void* p_buffer) The function does not allocate any extra storage, enqueues the calculation into the CUDA stream, and is non- blocking, i.e., it may return control back to the host before the result is ready. Analogously, LAPACK provides gttrf and gttrs. ACM Trans. Parallel Comput. 111:32 • Christoph Klein and Robert Strzodka Input Description p recommended value:TRIDIGPU_DEFAULT_PIVOTING (see above) N size of the block tridiagonal system n_r number of right-hand sides n size of one block; allowed values are 1, 2, 3, 4 a lower band of factorized tridiagonal matrixN(length * n * n) b diagonal of factorized tridiagonal matrixN (*length n * n) c upper diagonal of factorized tridiagonal matrixN (*length n * n) D dense right-hand side matrix and bufer of length N * n_r * n p_factorization_buffer bufer with additional data of the factorized matrix p_buffer temporarily required memory bufer on the GPU Output Description X dense solution matrix and bufer of length N * n_r * n. Bufer D may be passed also asX, then the solution replaces the right-hand side data after kernel execution. Example: size_t factorization_buffer_size_in_bytes = 0; if (tridigpuSbgtf_factorizationBufferSizeExt(p, N, n_r, n, &factorization_buffer_size_in_bytes)) { printf("determining factorization buffer size failed\n"); void* p_factorization_buffer; if (cudaMalloc(&p_factorization_buffer, factorization_buffer_size_in_bytes)) { printf("allocating factorization buffer failed\n"); if (tridigpuSbgtf(stream, p, N, n_r, n, a, b, c, p_factorization_buffer)) { printf("factorizing block tridiagonal system failed\n"); size_t buffer_size_in_bytes = 0; if (tridigpuSbgtfs_bufferSizeExt(p, N, n_r, n, &buffer_size_in_bytes)) { printf("determining temporary buffer size failed\n"); void* p_buffer; if (cudaMalloc(&p_buffer, buffer_size_in_bytes)) { printf("allocating temporary buffer failed\n"); while(...) { // - set right-hand sides `D` // - `p_buffer` does not need to be initialized before the call to `tridigpuSbgtfs` // - `p_factorization_buffer`,`a`,`b`,`c` is written by `tridigpuSbgtf` and read by `tridigpuSbgtfs` if (tridigpuSbgtfs(stream, p, N, n_r, n, a, b, c, X, D, p_factorization_buffer, p_buffer)) { printf("solving block tridiagonal system with factorization failed\n"); ACM Trans. Parallel Comput. tridigpu: A GPU library for block tridiagonal and banded linear equation systems • 111:33 A.4.3 Solve banded systems with multiple right-hand sides. This function solves Problem Class DIA. tridigpu_status_t tridigpu<t>gbsv_bufferSizeExt(tridigpu_pivoting_t p, int N, int n_r, int kl, int ku, size_t* buffer_size_in_bytes) tridigpu_status_t tridigpu<t>gbsv(cudaStream_t stream, tridigpu_pivoting_t p, int N, int n_r, int kl, int ku, const <T>* AB, <T>* X, const <T>* D, void* p_buffer) The function does not allocate any extra storage, enqueues the calculation into the CUDA stream, and is non- blocking, i.e., it may return control back to the host before the result is ready. The numeric behaviour of this function is equaltridigpu<t>bgtsv to and the banded system is converted on-chip into a block tridiagonal system. Input Description p recommended value:TRIDIGPU_DEFAULT_PIVOTING (see above) N size of the banded system n_r number of right-hand sides kl number of lower bands in AB; kl=1,2,3,4 ku number of upper bands in AB; ku=1,2,3,4 AB banded storage format of matrix and bufer of length N * (kl + ku + 1) D dense right-hand side matrix and bufer of length N * n_r p_buffer temporarily required memory bufer on the GPU Output Description X dense solution matrix and bufer of length N * n_r. Bufer D may be passed also asX, then the solution replaces the right-hand side data after kernel execution. Example: size_t buffer_size_in_bytes = 0; if (tridigpuSgbsv_bufferSizeExt(p, N, n_r, ku, ku, &buffer_size_in_bytes)) { printf("determining temporary buffer size failed\n"); void* p_buffer; if (cudaMalloc(&p_buffer, buffer_size_in_bytes)) { printf("allocating temporary buffer failed\n"); if (tridigpuSgbsv(stream, p, N, n_r, kl, ku, AB, X, D, p_buffer)) { printf("solving banded system failed\n"); ACM Trans. Parallel Comput. 111:34 • Christoph Klein and Robert Strzodka A.4.4 Solve scalar tridiagonal systems with multiple sparse right-hand sides and dense solutions. This function solves Problem Class ScalarCSC2Dense for D saved in a CSC format. tridigpu_status_t tridigpu<t>gtsv_csc2dense_bufferSizeExt(tridigpu_pivoting_t p, int N, int n_r, size_t* buffer_size_in_bytes) tridigpu_status_t tridigpu<t>gtsv_csc2dense(cudaStream_t stream, tridigpu_pivoting_t p, int N, int n_r, const <T>* a, const <T>* b, const <T>* c, <T>* X, const <T>* rhs_values, const int* rhs_row_indices, const int* rhs_col_ptrs, int rhs_nnz, void* p_buffer) The function does not allocate any extra storage, enqueues the calculation into the CUDA stream, and is non- blocking, i.e., it may return control back to the host before the result is ready. Input Description p recommended value:TRIDIGPU_DEFAULT_PIVOTING (see above) N size of the tridiagonal system n_r number of right-hand sides a lower band of tridiagonal matrix (length N) b diagonal of tridiagonal matrix (Nlength ) c upper diagonal of tridiagonal matrix (Nlength ) rhs_values value array of sparse right-hand sides in CSC format (length rhs_nnz) rhs_row_indices row index array of sparse right-hand sides in CSC format (length rhs_nnz) rhs_col_ptrs column pointer array of sparse right-hand sides in CSC format (n_r length + 1) rhs_nnz number of non-zero entries of sparse right-hand side matrix D p_buffer temporarily required memory bufer on the GPU Output Description X dense solution matrix and bufer of length N * n_r ACM Trans. Parallel Comput. tridigpu: A GPU library for block tridiagonal and banded linear equation systems • 111:35 Example: size_t buffer_size_in_bytes = 0; if (tridigpuSgtsv_csc2dense_bufferSizeExt(p, N, n_r, &buffer_size_in_bytes)) { printf("determining temporary buffer size failed\n"); void* p_buffer; if (cudaMalloc(&p_buffer, buffer_size_in_bytes)) { printf("allocating temporary buffer failed\n"); if (tridigpuSgtsv_csc2dense(stream, p, N, n_r, a, b, c, X, rhs_values, rhs_row_indices, rhs_col_ptrs, rhs_nnz, , p_buffer)) { printf("solving tridiagonal system with sparse rhs failed\n"); ACM Trans. Parallel Comput. 111:36 • Christoph Klein and Robert Strzodka A.4.5 General tridiagonal solve for many sparse right-hand sides and sparse solutions. This function solves Problem −1 Class ScalarCSC2CSC, i.e., it calculates A D for a general tridiagonal matrix A and a sparse matrixD. For large sparse matricesD the dense solution matrix would exceed the available memory, therefore, this functions −1 outputs only a sparse result matrix X. In fact,X = prune(A D, S ), thus the result is equal to the dense result pruned to a sparsity patternS . S is determined such that the maximum absolute values in the dense result are p p included in the sparse result. nnz If is the number of non-zero entries in column j of matrix X, and for eachj let −1 σ : [0, N − 1] → [0, N − 1] be an index permutation such that the j-columnX ofX := A D is sorted in σ (i ),j descending order with respect to the absolute values, then S is obtained by 1 if σ (i ) < nnz j j (S ) = (16) p i,j 0 otherwise . In words: the sparsity pattern of X is controlled by the user given column pointers nnz ),(and each column of the sparse result is illed with the maximum values of the dense result with respect to their absolute value. tridigpu_status_t tridigpu<t>gtsv_csc2csc(tridigpu_pivoting_t p, int N, const <T>* a, const <T>* b, const <T>* c, const <T>* rhs_values, const int* rhs_row_indices, const int* rhs_col_ptrs, int n_r, int rhs_nnz, <T>* result_values, int* result_row_indices, const int* result_col_ptrs, int result_nnz) The function allocates extra GPU storage, enqueues the calculation into the default CUDA stream, and is blocking the host, i.e., it returns control back to the host when the result is ready. Input Description p recommended value:TRIDIGPU_DEFAULT_PIVOTING (see above) N size of the tridiagonal system a lower band of tridiagonal matrix (length N) b diagonal of tridiagonal matrix (Nlength ) c upper diagonal of tridiagonal matrix (Nlength ) rhs_values CSC values of right-hand sides D (lengthnnz) rhs_row_indices CSC row indices of right-hand sides D (lengthnnz) rhs_col_ptrs CSC column pointers of right-hand sides D (lengthN + 1) n_r number of right-hand sides rhs_nnz number of non-zero entries of right-hand sides D result_col_ptrs CSC column pointers of the result matrix (length N + 1) result_nnz number of non-zero entries of the result matrix. Output Description result_values CSC values of sparse solution matrix X (lengthresult_nnz) result_row_indices CSC row indices of sparse solution matrix X (lengthresult_nnz) ACM Trans. Parallel Comput. http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png ACM Transactions on Parallel Computing Association for Computing Machinery

Tridigpu: A GPU Library for Block Tridiagonal and Banded Linear Equation Systems

Loading next page...
 
/lp/association-for-computing-machinery/tridigpu-a-gpu-library-for-block-tridiagonal-and-banded-linear-hvdBxiP466

References (48)

Publisher
Association for Computing Machinery
Copyright
Copyright © 2023 Copyright held by the owner/author(s).
ISSN
2329-4949
eISSN
2329-4957
DOI
10.1145/3580373
Publisher site
See Article on Publisher Site

Abstract

tridigpu: A GPU library for block tridiagonal and banded linear equation systems CHRISTOPH KLEIN, Institute of Computer Engineering (ZITI), Germany ROBERT STRZODKA, Institute of Computer Engineering (ZITI), Germany In this paper we present a CUDA library with a C API for solving block cyclic tridiagonal and banded systems on one GPU . The library can process block tridiagonal systems with block sizes from 1x1 (scalar) to 4x4 and banded systems with up to 4 sub- and superdiagonals. For the compute intensive block size cases and cases with many right-hand sides, we write out an explicit factorization to memory, however, for the scalar case the fastest approach is to only output the coarse system and recompute the factorization. Prominent features of the library are (scaled) partial pivoting for improved numeric stability, highest performance kernels, which completely utilize GPU memory bandwidth, and support for multiple sparse or dense right-hand side and solution vectors. The additional memory consumption is only 5% of the original tridiagonal system, which enables the solution of systems up to GPU memory size. The performance of the state-of-the-art scalar tridiagonal solver of cuSPARSE is outperformed by factor 5 for large problem sizes of unkno 2 wns, on a GeForce RTX 2080 Ti. CCS Concepts: · Mathematics of computing → Solvers; Mathematical software performance; · Computing method- ologies → Massively parallel algorithms ; Concurrent algorithms. Additional Key Words and Phrases: tridiagonal solver, GPUs, CUDA, banded solver, gtsv, gbsv 1 INTRODUCTION tridigpu is a library for the solution of scalar and block tridiagonal linear equation systems with multiple right-hand sides x · · · x d · · · d b c a 0,0 0,n −1 0,0 0,n −1 0 0 0 r r * + * + * + . / . / x · · · x d · · · d . a b c / 1,0 1,n −1 1,0 1,n −1 1 1 1 r r . / . / . / . / . / . / x · · · x d · · · d a b c . 2,0 2,n −1 / . 2,0 2,n −1 / 2 2 2 r r . / . / . / . / AX = . . . / = . . . / = D, (1) . . . . . . / . . . . / . . . . / . . . . / . . . . . . . . . . / . / . / . / . / . / . / . / . / x · · · x d · · · d a b c ˆ ˆ ˆ ˆ ˆ ˆ ˆ . N −2,0 N −2,n −1/ . N −2,0 N −2,n −1/ N −2 N −2 N −2 r r c a b x · · · x d · · · d , - ˆ ˆ ˆ ˆ ˆ ˆ ˆ N −1 N −1 N −1 , N −1,0 N −1,n −1- , N −1,0 N −1,n −1- r r N ×N n×n n×1 withA ∈ F , small coeicient blocks a ,b ,c ∈ F , x ,d ∈ F , and i = 0, . . . , N − 1, where F is the ield i i i i i ˆ ˆ of realR or complex numbersC. MatrixA has N block rows with block size n ∈ N and N = Nn scalar rows. The number of right-hand sides of the equation system is denotedn with ∈ N. Solving scalar tridiagonal systems n = 1)( is required in many applications, e.g. in electrodynamics 23], luid[ dynamics22 [ , 26], or computer graphics25 [ ], whereas the solution of block tridiagonal systems n > 1)(is required in computational inance [15], computational luid dynamics [36], or image restoration problems [6]. Equation 1 encompasses multiple problem classes which tridigpu can solve eiciently. These problem classes guide our discussion of the C API, the algorithms, the implementations and the results throughout the paper. Authors’ addresses: Christoph Klein, christoph.klein@ziti.uni-heidelberg.de, Institute of Computer Engineering (ZITI), Im Neuenheimer Feld 368, Heidelberg, Germany, 69126; Robert Strzodka, robert.strzodka@ziti.uni-heidelberg.de, Institute of Computer Engineering (ZITI), Im Neuenheimer Feld 368, Heidelberg, Germany, 69126. Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for proit or commercial advantage and that copies bear this notice and the full citation on the irst page. Copyrights for third-party components of this work must be honored. For all other uses, contact the owner/author(s). © 2023 Copyright held by the owner/author(s). 2329-4949/2023/1-ART111 https://doi.org/10.1145/3580373 ACM Trans. Parallel Comput. 111:2 • Christoph Klein and Robert Strzodka Problem Class Cyclic cyclic tridiagonal systems with multiple right-hand a ,c sides , 0), ( 0 ˆ N −1 Problem Class Scalar scalar tridiagonal systems with multiple right-hand n =sides 1), ( Problem Class Block block tridiagonal systems with multiple right-hand n >sides 1), ( Problem Class DIA banded systems with multiple right-hand sides n > 1),( Problem Class ScalarCSC2Dense scalar tridiagonal systems with multiple sparse right-hand AX sides, = D, with a compressed sparse column encoded matrix D (see CSC format in cuSPARSE [34]), Problem Class ScalarCSC2CSC scalar tridiagonal systems with many sparse right-hand sides and a pruned incomplete result, AX = D, with CSC encoded large sparse matrices D and X. In the following we list one small example for each problem class. Obviously, one example cannot express all possible parameter variations within the corresponding problem class, but only highlight a certain characteristic aspect. Example of Problem Class Cyclic is a system with a , c , 0, n = 1, N = 6, and n = 1 0 ˆ r N −1 7 13 1 x 1 0,0 * + * + * + 2 8 14 x 2 . / . 1,0/ . / . / . / . / . / . / . / 3 9 15 x 3 2,0 . / . / . / AX = = = D. (2) . / . / . / 4 10 16 x 4 . / . 3,0/ . / . / . / . / . / . / . / 5 11 17 x 5 4,0 18 6 12 x 6 , - , 5,0- , - Example of Problem Class Scalar is a system with n = 3, N = 6, and n = 1 6 12 x x x 1 7 13 0,0 0,1 0,2 * + * + * + 1 7 13 x x x 2 8 14 . / . / . / 1,0 1,1 1,2 . / . / . / . / . / . / 2 8 14 x x x 3 9 15 2,0 2,1 2,2 . / . / . / AX = = = D. (3) . / . / . / 3 9 15 x x x 4 10 16 . / . 3,0 3,1 3,2/ . / . / . / . / . / . / . / 4 10 16 x x x 5 11 17 4,0 4,1 4,2 5 11 x x x 6 12 18 , - , - , - 5,0 5,1 5,2 Example of Problem Class Block is a system with n = 1, N = 8, N = 4, and n = 2 13 15 29 31 * + * + . / 14 16 30 32 . / . / . / . / . / 1 3 17 19 33 35 x 3 . / 0,0 . / . / * + . / . / . / 2 4 18 20 34 36 x 4 1,0 . / . / . / AX = = . / = D. (4) . / . / . / 5 7 21 23 37 39 x 5 . / . 2,0/ . / . / . / . / 6 8 22 24 38 40 x 6 3,0 . / . / , - . / . / .7/ 9 11 25 27 . / 10 12 26 28 , - , - ACM Trans. Parallel Comput. tridigpu: A GPU library for block tridiagonal and banded linear equation systems • 111:3 Example of Problem Class DIA is a pentadiagonal system with n = 1, N = 8, N = 4, and n = 2 13 15 29 1 * + * + . / 14 16 30 32 2 . / . / . / . / . / 1 3 17 19 33 x 3 . / 0,0 . / . / * + . / . / . / 4 18 20 34 36 x 4 . / 1,0 . / . / AX = = . / = D. (5) . / . / . / 5 7 21 23 37 x 5 . / . / 2,0 . / . / . / . / 8 22 24 38 40 x 6 3,0 . / , - . / . / . / . / . 9 11 25 27 / 12 26 28 8 , - , - Example of Problem Class ScalarCSC2Dense is a system with n = 6, N = 6, N = 6, n = 1, sparse matrix 6×6 D ∈ R and a dense solution 6 12 x x x x x x 1 2 3 4 0,0 0,1 0,2 0,3 0,4 0,5 * + * + * + . / . / . / 1 7 13 x x x x x x 5 6 7 1,0 1,1 1,2 1,3 1,4 1,5 . / . / . / . / . / . / 2 8 14 x x x x x x . / . 2,0 2,1 2,2 2,3 2,4 2,5/ . / AX = . / . / = . / = D. (6) . / . / . / 3 9 15 x x x x x x 8 9 3,0 3,1 3,2 3,3 3,4 3,5 . / . / . / . / . / . / . 4 10 16/ .x x x x x x / .10 / 4,0 4,1 4,2 4,3 4,4 4,5 5 11 x x x x x x 11 12 13 5,0 5,1 5,2 5,3 5,4 5,5 , - , - , - −1 The sparse matrixD contains the right-hand sides and we calculate X = A D. For largeA, this operation is only possible Dif has few columns, otherwise the dense X becomes too big to store. Example of Problem Class ScalarCSC2CSC is a system with n = 6, N = 6, N = 6, n = 1, and sparse 6×6 matricesD, X ∈ R , which is inverted exactly, but the output is written in a sparse form by omitting small values 6 12 x x 1 2 3 4 0 6 * + * + * + . / . / . / 1 7 13 x 5 6 7 . / . / . / . / . / . / 2 8 14 x x x x . / . 1 4 8 11 / . / AX = . / . / = . / = D. (7) . / . / . / 3 9 15 x x x 8 9 2 9 12 . / . / . / . / . / . / 4 10 16 10 . / . / . / 5 11 x x x 11 12 13 3 5 10 , - , - , - −1 The calculated sparse matrix X := prune(X , S ) is the dense result X := A D pruned to the sparsity pattern S (all coeicients outside S are discarded). This operation is possible even for A and large D with p p many columns because the stored solution X is sparse. The sparsity pattern S is chosen such that the largest absolute values in each column of the dense result X are included in the sparse result X. The rest of the paper is organized as follows: Section 2 discusses related work and Section 3 gives an overview of the contributions of this paper. Parallel algorithms are presented in Section 4 and their implementation details in Section 5. We conclude the paper with the numerical and performance results in Section 6. Appendix A explains the API oftridigpu from a user perspective. 2 RELATED WORK On sequential architectures the Thomas Algorithm 37] is [ the classical choice to solve tridiagonal systems. Due to the increasing amount of parallel computer architectures, parallel algorithms like the Cyclic 23]Re (CR), duction [ the Parallel Cyclic Reduction 13] (PCR), [ the Partition Metho39 d [], and the Divide and Conquer 40[] approach were later developed. The Cyclic Reduction and the Parallel Cyclic Reduction were implemented by many2,authors 7, 20, 25[, 27] for a GPU. With the diagonal pivoting from Erway14et],al. the[irst numerically stable tridiagonal solver for the ACM Trans. Parallel Comput. 111:4 • Christoph Klein and Robert Strzodka Symbol Meaning A block tridiagonal matrix or banded matrix of N ×size N N number of unknowns in the linear equation system n size of one coeicient blonck,= 1, 2, 3, 4 ˆ ˆ N number of block rows in the linear equation system N = N( n) a lower band of coeicient blocks in the block tridiagonal matrix b diagonal of coeicient blocks in the block tridiagonal matrix c upper band of coeicient blocks in the block tridiagonal matrix k number of lower bands in a banded matrix, k = 0, 1, 2, 3, 4 l l k number of upper bands in a banded matrix, k = 0, 1, 2, 3, 4 u u n number of right-hand sides of the linear equation system D dense right-hand sides; matrix ofN size × n D sparse right-hand sides encoded as a CSC matrix ofN size × n X dense solution vectors; matrix of size N × n X sparse solution vectors encoded as a CSC matrix of size N × n S sparsity pattern matrix of size N × n p r M number of block rows per partition of the tridiagonal system L number of partitions of size M, which are processed by one CUDA block nˆ number of right-hand sides, which are kept simultaneously in on-chip memory n number of right-hand sides per batch of function tridigpu<t>gtsv_csc2csc Table 1. Symbols and their meanings. The upper part refers to the interface tridigpu of , whereas the lower part refers to internal library parameters. GPU was implemented by Chang et al. [4, 5] and improved in aspects of numeric stability by Venetis et al. [38], who pointed out the problematic behaviour of diagonal pivoting and singular submatrices and therefore proposed g-spike. Two tridiagonal solvers were proposed by Diéguez et9ś al. 12,[35]: The Tree Partitioning Reduction (TPR) and a Wang and Mou implementation with their Butterly Processing Library for GPUs (WM BPLG). The tridiagonal partitioning scheme deployed in this paper is also used by18Giles ] and Lászlo et al. [et al.31 [ ] to solve many independent small tridiagonal systems with a size of up to 1000 unknowns. The same partitioning scheme is also used by Klein and Strzodka 30][to solve large non-cyclic scalar tridiagonal systems with a single right-hand side, and Kim et 28 al.],[who proposed PaScaL_TDMA, which is a tridiagonal solver without pivoting for distributed memory machines, which communicate with MPI, to solve very large tridiagonal systems. Batched block tridiagonal solv31 ers] or [ batched scalar pentadiagonal solvers 17,[19] (PCR-Penta and cuPent- Batch) were implemented on GPUs, but only Kamra and Rao 24[] solved large block tridiagonal systems, who developed a parallel algorithm (DPBWZA) for block tridiagonal toeplitz-block-toeplitz linear systems. Ghosh and Mishra17 [ ] proposed a Parallel Cyclic Reduction algorithm (PCR-Penta) to solve batched pentadiagonal systems. The above mentioned papers only treat special cases (batched, Toeplitz, pentadiagonal), whertridigpu eas our library solves a much more general problem (Eq. 1) including large cyclic block tridiagonal systems with multiple right-hand sides and support for (scaled) partial pivoting. Mor tridigpu eover, solves general block tridiagonal systems without requiring symmetry and provides a comprehensive C API for multiple value types, as well as solving and factorizing operations. ACM Trans. Parallel Comput. tridigpu: A GPU library for block tridiagonal and banded linear equation systems • 111:5 3 CONTRIBUTIONS This paper is a thorough extension of the tridiagonal solver proposed by Klein and30 Strzo ] into dkaa[ multi- functional library and makes the following new contributions: • support of diferent types of matrix coeicients and vector elements, • solution of cyclic tridiagonal systems a , 0, c( , 0), 0 ˆ N −1 • multiple right-hand sides n >(1), • performance optimizations for small scalar systems N < 10( ), • block tridiagonal systems and their factorization n > 1, n ( ≥ 1), • banded systems with up to four sub- and super-bands on each side of the diagonal, −1 • calculation A of D for sparse matricesD. 3.1 Limitations Internallytridigpu uses the Recursive Partitioned Schur Complement Algorithm (RPTS) 30], in [ order to expose suicient parallelism and solve large systems hierarchically. RPTS is based on the static partition methods, the accuracy of which are limited by the conditioning of the coarse system, which is not controlled explicitly. Therefore, ill-conditioned coarse systems can negatively afect the quality of the calculated solution, although in practice this is a seldom occurrence. In case of scalar values, (scaled) partial pivoting works as expected by comparing the absolute values and implicitly permuting the rows. In case of small coeicient a ,bblo ,c cks ), the ( determinants are compared and the i i i block rows are implicitly permuted, but no row permutations within a block row occur. This imposes assumptions on the determinants of certain blocks and thus omits some opportunities for better numerical stability in favor of a uniied code basis. 3.2 C API Overview tridigpu follows the BLAS naming convention tridigpu<t><format><operation>, where <t> can be S, D, C, Z, which represent the typesfloat, double, cuComplex, and cuDoubleComplex, respectively. As a placeholder for the corresponding types we<Tuse > and [c] denotes an optional character ’c’ in a function name <format . > is either [c]gt ’ ’ for (cyclic) general tridiagonal systems, [c]bgt’ ’ for (cyclic) block general tridiagonal systems,gbor’ for ’ general banded systems.<operation> is either sv’’ for a solving operation,f’’ for a factorizing operation, fsor ’ for ’ a solving operation with a given factorization. tridigpu provides the following functions: • tridigpu<t>[c]gtsv: This function solves Problem Class Cyclic and Problem Class Scalar, which are scalar (cyclic) generaltridiagonal systems with multiple right-hand sides. • tridigpu<t>[c]bgtsv: This function solves Problem Class Block, which cyclic) are (blockgeneraltridiagonal systems with multiple right-hand sides. • tridigpu<t>bgtf, tridigpu<t>bgtfs: These functions solve Problem Class Block, wher bgtf e factorizes the matrix explicitlybgtfs and solves the system for a speciic right-hand side and a given factorization. • tridigpu<t>gbsv: This function solves Problem Class DIA, which general are banded systems with multiple right-hand sides. • tridigpu<t>gtsv_csc2dense: This function solves Problem Class ScalarCSC2Dense, which are scalar tridiagonal systems with multiple sparse right-hand sides and dense solutions. • tridigpu<t>gtsv_csc2csc: This function solves Problem Class ScalarCSC2CSC, which are scalar tridiag- onal systems with many sparse right-hand sides and sparse solutions. ACM Trans. Parallel Comput. 111:6 • Christoph Klein and Robert Strzodka II III IV VI V Fig. 1. Matrix graph representation and matrix paterns during diferent phases of RPTS assuming no row permutations took place. The partition sizeM is equal to 7 and the system size is N = 21. A detailed explanation of the C API is located in the Appendix A. The API design follows the conventions from cuSPARSE [34] and LAPACK [3], e.g., only the factorizing function bgtf overwrites the input tridiagonal system. Except tridigpu<t>gtsv_csc2csc, all functions do not allocate any extra storage, enqueue the calculation into a CUDA stream, and are non-blocking, i.e., they may return control back to the host before the result is ready. 4 ALGORITHM The library presented in this paper is based on Rethe cursivePartitioneTdridiagonal Schur Complement algorithm with (scaled) partial pivoting. RPTS was presented by Klein and Strzo 30],dka to solv [ e non-cyclic scalar tridiagonal systems with a single right-hand side, i.e. one special n =case 1) of ( Problem Class Scalar. All presented problem classes in this paper can only be solved with the newly developed algorithms and implementations. Nevertheless, the mathematical formulation fr30 om ] is [ the basis to solve the problem classes presented here. Therefore, in the following Section 4.1 and Figure 1, we revise the irst part of Section 30]3with fromregar [ d to cyclic dependencies and block tridiagonal systems, while the following Sections 4.2- 4.4 describe how RPTS has been extended to solve the much more general problem from Equation 1. 4.1 The Recursive Partitioned Tridiagonal Schur Complement Algorithm 4.1.1 Algorithmic Description. A (cyclic) tridiagonal adjacency matrix of a graph represents a long chain of connected vertices, which is shown on the left side in Figure 1. This chain is partitioned into regular partitions of sizeM. Here, a vertex can be a scalar or a block matrix of size n, which results in Mn equations for each partition. For all partitions in parallel, downwar the ds oriented elimination of the lower block band and the upwards oriented elimination of the upper block band is realised by applying sequential elementary row operations with pivoting as shown in Figure 1 after step I. The downwards and upwards oriented eliminations both take the original matrix as input, and therefore, representN2/M independent work tasks. During this elimination, in each partition, vertical ill-ins occur on the left and right side, which also contribute to the last and irst block row of the partition. These rows form an independent set ofN2/M block equations (the coarse system) and are stored as a separate equation ACM Trans. Parallel Comput. tridigpu: A GPU library for block tridiagonal and banded linear equation systems • 111:7 system, which is again tridiagonal, as shown by step II, in Figure 1. Steps I and II reprresent eduction the stepof RPTS. The coarse system is much smaller than the ine system, e.g., withM = 37 it has a size of only 5% of the ine system. As long as the coarse system is too large the reduction step is applied recursively until it is small enough to be solved directly (step III). Then, the solution of the coarse system is substituted in the ine system (step IV) and the downwards oriented elimination is executed again but due to the substitution no ill-in occurs this time (step V). After the latter step, the solution of jthe = M − 2 block column is already available in each partition and is then substituted in the above block equation to calculate the solution j =of M − the3 column. This upwards oriented substitution continues in the same fashion until all solutions ofMthe − 2inner block rows are computed (step VI). Steps IV, V and VI represent the substitution stepof RPTS, which only exposes N /M independent work tasks because the upwards oriented substitution is executed after the downwards oriented elimination. The redundant execution of the downwards oriented elimination in the reduction and substitution step of RPTS incurs additional calculations but for scalar tridiagonal systems RPTS is bandwidth limited, and thus, reducing data movement at the cost of more arithmetic operations is beneicial. With solvingthe of the inner partitions we refer to the downwards and upwards oriented elimination in the reduction step; and to the downwards oriented elimination and upwards oriented substitution in the substitution step. 4.1.2 Mathematical Description. With the partitioning of the graph in the left part of Figure 1 into partitions of sizeM, each partition has M − 2 inner vertices and 2 interface vertices. P bLet e the index set of all inner vertices in the partition and I be the index set of all interface vertices. With respect to these index sets A can matrix be reordered by a permutationQ into the(M − 2) × (M − 2)-block diagonal A , the 2 × 2-block diagonal A , the PP I I wideA , and the tallA . The Schur-complement factorization then reads: I P PI ! ! ! A A I 0 A A T PP PI PP PI QAQ = = , (8) −1 A A A A I 0 S I P I I I P PP −1 S := A − A A A , (9) I I I P PI PP with the Schur-complement S. In Equation 8,QAQ is a permuted form of A from Equation 1. The above factorization allows the recursive solution of ! ! ! A A x d T PP PI P P QAQ x = = = d (10) A A x d I P I I I I in three steps ′ −1 Reduce: Calculation of the coarse system S and the coarse right-hand side d := d − A A d . I I P P I PP Solve coarse system: The coarse system Sx = d is solved by factorizing S recursively. ′ −1 ′ Substitute: ind := d − A x substitutex and solvex := A d for the inner vertices. P PI I I P P PP P 4.2 Cyclic Systems In case of cyclic tridiagonal systems a ,c ( , 0), the coarser system S of Equation 9 is also cyclic. During the 0 N −1 RPTS reduction step (Figure 1, steps I & II), the values a and c , or due to pivoting their scaled values, are 0 N −1 written to the upper right and lower left corner of the coarse equation system. In the higher stages, either the direct solver (step III) must be able to solve cyclic tridiagonal systems, or the system is reduced toNa=size 2. of In the latter case, the inal RPTS reduction to a 2x2 system incorporates the values a and c during the last 0 N −1 ACM Trans. Parallel Comput. 111:8 • Christoph Klein and Robert Strzodka row eliminations because the ill-in aand , cthe are located in the same column. 0 N −1 b c a b c a 0 0 0 0 0 0 * + * + a b c a b r . 1 1 1 / . 1 1 1/ . / . / . / . / l b c a b r 2 2 2 2 2 2 . / . / downwards , upwards (11) . / . / l b c a b r . 3 3 3 / . 3 3 3/ . / . / . / . / l b c a b c 4 4 4 4 4 4 c a b c a b , 5 5 5- , 5 5 5- In a system where no row permutations take place and the last variables are about to be eliminated, which a is in the downwards andc in the upwards oriented elimination, as shown in Equation 11, thelleft refers ill-in to 0 4 the same unknown as c . 4.3 Block Tridiagonal Systems To solve block tridiagonal systems, the RPTS reduction and substitution algorithms 30] arein adjuste [ d to support the non-commutativity of matrix-matrix and matrix-vector operations, and the divisions are replaced by multiplications with an inverse matrix. Algorithm 1: Scaled partial pivoting of the RPTS substitution step for block typnes, which of sizealso applies to the scalar case n = ( 1). 1 Function get_max_abs(X , X , . . . , X ) 0 1 l−1 n×n 2 P ∈ R 3 P = 0 4 for i = 0, . . . ,l − 1 do 5 for j = 0, . . . ,n − 1 do 6 for k = 0, . . . ,n − 1 do 7 P = max(| (X ) |, P ) jj i jk jj 8 end 9 end 10 end 11 return P 12 end n×n 13 b ,c ∈ R // previous equation j j n×n 14 a ,b ,c ∈ R // current equation j+1 j+1 j+1 15 m = get_max_abs(b ,c ) j j j 16 m = get_max_abs(a ,b ,c ) j+1 j+1 j+1 j+1 17 if |det(m · a )| ≤ |det(m · b )| then j j+1 j+1 j /* normal case: invert b */ 18 else /* pivot case: invert a */ j+1 19 end ACM Trans. Parallel Comput. tridigpu: A GPU library for block tridiagonal and banded linear equation systems • 111:9 We consider the previous equation b ,c and the current equationa ,b ,c , during the downwards j j j+1 j+1 j+1 oriented elimination, when either b or a is about to be eliminated. j j+1 * + .. . . / . . . . / . . . . / . / . / 0 b c j j . / (12) . / a b c . j+1 j+1 j+1 / . / . / . . . . . . . / . . . . / , - −1 For that purpose, the previous equation can be multiplied (−with 1)a b and be added to the current equation, j+1 −1 or the current equation can be multiplied (with −1)b a and be added to the previous equation. So implicitly j+1 we require that at least b or a is invertible. j j+1 Whereas the scalar pivoting decision is made based on the absolute value of a scalar, the pivoting for block types is made based on the absolute value of the determinant of the block, which is shown in Algorithm 1. To make the pivoting decisions invariant to the scaling of the linear system by a diagonal matrix, the scaled partial pivoting decision is made with respect to the maximum absolute value within each row of the equation system (diagonal matrices m and m ). Settingm and m to the identity matrix, degrades the scaled partial pivoting i i+1 i i+1 to just partial pivoting, and setting them to zero disables pivoting. 4.4 Separate Factorization and Solve An explicit factorization of the tridiagonal matrix is beneicial, if the computation cannot be hidden behind data movement any more, which is the case for block tridiagonal systems and multiple right-hand sides. After the downwards oriented elimination (step I, Figure 1), the calculated LU-factorization of the inner partitions (d ) (d ) (d ) (d ) (L + I )U := A is saved in memory, wherLe is strictly lower triangular U andupper triangular. As PP PP PP PP PP a consequence of saving the modiied coeicients, the upwards and downwards oriented elimination are not executed in parallel, but in two subsequent steps: upwards, and then downwards. The latter only afects the order of computations and the coarse system S of the factorization is the same as that of Equation 9. During the (u ) (u ) upwards oriented elimination, which computes the UL-factorization (U + I )L := A , with strictly upper PP PP PP (u ) (u ) (u ) (u ) triangular matrix U and lower triangular matrix L , the coeicients of L are discarded, butU is kept, PP PP PP PP which is required to calculate the coarse right-hand d side for a right-hand side d. Similar to the solve of scalar tridiagonal systems in Section 4.1, the solution of the inner x vertices with a given factorization is obtained by (d ) (d ) −1 ′ x := ((L + I )U ) d . PP PP P 5 IMPLEMENTATION The core implementation oftridigpu< the t><format>sv functions, consists of two CUDA kernels:rethe duction kernel and the substitution kernel , which compute the reduction and substitution step of RPTS (see Section 4.1) for arbitrary types of the matrix coeicients and vector elements. In fact, the templatized source code for solving the inner partitions is the same for all problem classes, whereas the source code, which loads the data to on-chip memory and writes the result to global memory, difers between the problem classes. The beginning of this section summarizes general implementation features, similar 30], buttoextende [ d to the general problem classes of this paper. More elaborate topics are discussed in the following subsections 5.1- 5.6. Active warps: The number of processed partitions per CUDA block is usually L = 32 because we want a large M (shallow recursion) and the product LM is limited by the amount of available shared memory per block. For L = 32, there are two CUDA warps in the reduction kernel and one CUDA warp in the substitution ACM Trans. Parallel Comput. 111:10 • Christoph Klein and Robert Strzodka ... eq. j 1 1 eq. j eq. j 1 1 0 0 ... eq. j + 1 start 0 0 eq. j + 1 eq. j + 1 Fig. 2. Pivoting decisions represented as a binary tree and encoded as the bit patern 0b000...1101. Either the current equationj or the next equation j + 1 is used to eliminate the coeficient in column j. kernel read elements written elements read & written 2N 2N 2N reduction 3N + n N 3 + n 3N + n N + (3 + n ) r r r r M M M (source) (A + D) (coarse A + coarse D) n 2N n 2N r r substitution 3N + n N + n n N 3N + 2n N + n r r r r r nˆ M nˆ M r r (source) (A + D + coarse X ) (ineX) n n r 2N 2N r 2N total 3N ( + 1) + 2n N + n (3 + n ) + n N 3N ( + 1 + n ) + (2n + 3) r r r r r r nˆ M M nˆ M r r (source) (ine system + coarse system) Table 2. Data reads and writes from global GPU memory for one stage and diferent kernelstridigpu<t>gtsv of in case of a scalar tridiagonal system. See Table 1 for the meaning of the variables. The gray bracketed expressions indicate to which variable the reads or writes correspond. kernel solving the inner partitions actively. The other threads in the CUDA block only participate during data load and store operations. Memory layout: The block tridiagonal matrix is saved in a banded format, which are threae ,bufers b , ( i i c ) of lengthN . a and c couple the irst and last equation and are non-zero for cyclic problems. The i 0 ˆ N −1 additional memory consumption of RPTS, which is required by the higher stages in the hierarchy, is very 25 20 ˆ ˆ little, and for n = 1, N = 2 or N = 2 and M = 41 amounts to less than 6% of the input data. Shared memory layout: The block tridiagonal partitions and the corresponding part of the right-hand side is loaded coalesced into shared memory, where it is subsequently processed sequentially by an active thread, which solves the inner of a partition. Note, that the shared memory consumption increases quadratically withn, e.g., going from block size n = 2 to n = 4 quadruples the consumption. Storage format of row interchanges during pivoting: The pivoting decisions can be saved in a binary tree because for each columnj we only encode if the coeicient injror owj + 1 is eliminated (see Figure 2). By saving this bit pattern in int64_t an , the row permutation can be reconstructed out of that bit pattern. Complete avoidance of SIMD divergence: Dynamic pivoting decisions tend to be implemented with if- else statements and branching. Nevertheless, the code for solving the inner partitions in the reduction and substitution kernel of RPTS are completely free of branching by replacing if-else statements with ternary operators (x = p ? a : b). The elimination of the coeicient in jcolumn is then formulated as a linear combination of two equations where the scaling factors of the equations depend on the pivoting decision. 5.1 Multiple Right-Hand Sides The reduction kernel of tridigpu coarsens the tridiagonal system, until it is small enough to be solved directly. Subsequently, the substitution kernel redundantly solves the inner partitions, but without any ill-in to obtain ACM Trans. Parallel Comput. tridigpu: A GPU library for block tridiagonal and banded linear equation systems • 111:11 the ine solution vector. For the solution of one scalar tridiagonal system the solver reads and writes in total 9N elements on the irst stage. This can be seen in Table 2, wher nˆ is e the number of right-hand sides, which are kept simultaneously in on-chip memory, such that the amount of inner partition solves is reduced. Now settingn = 1, nˆ = 1, and neglecting the terms of the coarse system with factor N /M 2 , 9N elements are read r r and written in total. With a coarse system size N /of M,2the amount of data which is read and written in higher stages is of order·92N /M = 18N /M, which is one order of magnitude smaller than the ine system N ) be(9cause typical values for M for a scalar tridiagonal system are in the range , 41]. [31For multiple right-hand sides, the tridiagonal system is only loaded once into on-chip memory in the reduction kernel because the tridiagonal system is not changed in memory during the inner partition solve. The same optimization cannot be applied to the substitution kernel because the tridiagonal system is modiied during the inner partition solve but the tridiagonal system is only loaded once nˆ for right-hand sides. In comparison to a naive repeated solver execution, which needs to processE = n 9N elements, solving n right-hand sides with one single solver execution only r r processes E = 3N (n /n + 1 + n ) elements, which in relation E is to only r r r 3N (n /nˆ + 1 + n ) r r r E /E = n 9N n 3N (1/nˆ + 1/n + 1) r r r n 9N 1/4 + 1/n + 1 (nˆ = 4) = 5/4 (n ≫ 1) ≈ ≈ 42%. 5.2 Optimizations for Small Tridiagonal Systems For a static partition size M, we observe that solving small tridiagonal systems is not utilizing all available GPU resources, which is expected because M determines the work per CUDA thread and is chosen as a trade-of between available shared memory and number of stages. For smaller M, less shared memory is used per CUDA block, therefore more CUDA blocks can be resident per streaming multiprocessor, but the size of the coarser ′ ′ tridiagonal system N increases because of N = 2N /M. For largerM, the shared memory consumption is high, and fewer CUDA blocks are resident per streaming multiprocessor, which is ineicient in the limit of only one active CUDA block because for each CUDA block only one warp is solving the inner partitions in the substitution kernel. To achieve maximum GPU performance it is essential to make use of latency hiding techniques by changing the context to another warp if the current warp is idling. The latter is impossible if only one CUDA block is resident per streaming multiprocessorM , thus, must not be chosen too large. Therefore, our aim is to decrease the number of the stagesand to increase the GPU utilization. To address the former point, we developed a fused inal stage kernel (FFSK), which is able to solve tridiagonal systems warpSize of size· M on chip, and to address the latter point we use a dynamic choice M,of which depends on the GPU occupancy. 5.2.1 Fused Final Stage Kernel. The FFSK is a kernel fusion of the reduction kernel, single thread inal stage kernel, and substitution kernel of RPTS. The single thread inal stage kernel of 30]RPTS only[uses one single ˜ ˜ CUDA thread to solve tridiagonal systems of maximumNsize = 32. N is the maximum tridiagonal system size of the inal stage tridiagonal solver. The FFSK loads the tridiagonal once system into shared memory, calculates the RPTS reduction step, saves the coarse tridiagonal system in shared memory, solves the coarse system, calculates the RPTS substitution step to obtain the ine solution, and stores the ine solution back to global GPU memory. It is possible to use other well known parallel algorithms, like the (Parallel) Cyclic Reduction, as a inal stage solver, but in our experiments we observed that these algorithms are less numerically stable. The FFSK is implemented with (scaled) partial pivoting for improved numeric stability. ACM Trans. Parallel Comput. 111:12 • Christoph Klein and Robert Strzodka Algorithm 2: Dynamic choice of the partitionMsize for each stagei of RPTS. input : N , sorted {M }, nˆ j r output: {N }, {M } i i 1 j = max ({M }) max j j ˜ ˜ 2 N = warpSize · M max 3 N = N 4 i = 0 5 M = M 0 j max 6 while N > N do 7 j = j − 1 max 8 while grid_dim(N , M ) < max_concurrent_blocks(N , M ,n ) ∧ M , M do i i i i r i 0 9 M = M i j 10 j = j − 1 11 end 12 i = i + 1 13 N = 2 · ceil(N /M ) − (N % M == 1 ? 1 : 0) i i−1 i−1 i−1 i−1 14 M = M // set M for next loop iteration i j max 15 end 16 M = get_smallest_final_stage_M(N , {M }) i i j 5.2.2 Dynamic choice of M. Algorithm 2 shows the dynamic choice of the partition size for each i of stage RPTS, where we denote an array of elements with curly braces { · }. The {M } are the sorted partition sizes, wherM eas j i is the partition sizeNand the tridiagonal system size for stage i. In Line 1 the indejx of the largest admissible i max partition size is obtained. In Line 6, the while-loop processes every stage coniguration until the system is small enough to be solved directly. In Line 8, a maximum value M for is chosen, such that the CUDA grid is at least as large as the maximum possible number of concurrently running blocks. Subsequently, in Line 13, the size of the coarser tridiagonal system is calculated, and the algorithm ends by calculating the smallest possible partition size for the inal stage solver in Line 16. 5.3 Block tridiagonal systems The kernels oftridigpu are implemented for general numerical types, which supp +,-,*,/,abs ort . Commuta- tivity for operator *’ is ’ not required as this is not fulilled for block types either/. Op ’ for erator block ’ types is equal to the multiplication with the inverted blo abs ck,calculates and the absolute value of the determinant of the block, which is required to make the pivoting decisions. In the kernel implementation, the pivoting scheme (see Section A.4) is passed in as a template parameter, which avoids any uneccessary operations at runtime. With generic programming and templates, no seperate code paths are required to support block tridiagonal systems. Instead of instantiating the kernels with scalar value types, they are instantiated with a block matrix of ixed size when solving block tridiagonal systems. 5.4 Factorization of a Block Tridiagonal Matrix The computational efort of those kernels, which always calculate the factorization anew (functions tridigpu<t><format>sv) grows as O (n ). Therefore, a large block size n prevents the kernels from running at full GPU memory bandwidth. In that case, it is eicient to precalculate the tridiagonal matrix factorization, such that no block inversions, ACM Trans. Parallel Comput. tridigpu: A GPU library for block tridiagonal and banded linear equation systems • 111:13 kernel bytes read bytes written ˆ ˆ 2 2 2 2N 2 N ˆ ˆ ˆ factorization, bgtf 3N n s (3N n + 2N n + 3 n )s + 2 s t t i M M (source) (ineA) (factorizeA d + r and r + coarse A + pivots) 2N ˆ ˆ ˆ reduction,bgtfs (2N n + nN n )s + 2N /Ms n ns r t i r t (source) (factorizeA d + ineD + pivots) (coarse D) ˆ ˆ 2N N 2 2 ˆ ˆ ˆ ˆ substitution, bgtfs (N n + 3N n + n N n + n n )s + s n N ns r r t i r t M M (source) (r + factorizeA d + ineD + coarse X + pivots) (ineX ) Table 3. Data reads and writes from global GPU memory for one stage and diferent kernelstridigpu<t>bgtf of and tridigpu<t>bgtfs. s is sizeof(T) and s is sizeof(int64_t). See Table 1 for the meaning of the variables. The gray t i bracketed expressions indicate to which variable the reads or writes correspond. Fig. 3. Banded format to block tridiagonal format conversion. Each color refers to one band. Her N =e,14, N = 7, k = 2, k = 2, and consequentlyn = 2. determinant calculations and matrix-matrix operations are calculated redundantly. The overall data movement is shown in Table 3. The factorization kernel calcuates the coarse system N /M blo (2 cks), the pivot decisions ˆ ˆ (N /M elements of typeint64_t), the inverted blocks r ,r (2N blocks), and the modiied tridiagonal system after u d the downwards oriented elimination N blo (3cks). With r and r we denote the row multiplication factors of u d the upwards and downwards oriented elimination, respectively. In the lower band of the factorized matrix, we −1 save the inverted blocks a as these values must be inverted in the upwards oriented substitution step of RPTS (Algorithm 2 in 30],[ Line 31) to obtain the solution values. Therefore, the substitution and reduction kernel, which solve a system with a given factorization, are completely free of block inversions. Given a precomputed factorization, the reduction kernel only N reads inv2erted blocksr (, r ), the pivot u d decisions, which are eiciently saved in bits, and the right-hand sides Nn blo with cks; and only the coarse right-hand sides are written to global memory. The upwards and downwards oriented elimination is calculated in parallel and only consists of matrix vector multiplications and vector additions. If we neglect the reading of the pivot decisions, the amount of read data is reduced for the reduction kernel bgtfs by of N elements in comparison tobgtsv. The substitution kernelbgtfs of reads one additional bandNofelements in comparison to bgtsv, which are the inverted blocks r of the downwards oriented elimination, and the pivot decisions. The downwards oriented elimination must be calculated for the right-hand sides with the already available inverted blocks and pivot decisions. This cannot be pre-calculated in the reduction kernel because the solution from the coarse system is not yet available. ACM Trans. Parallel Comput. 111:14 • Christoph Klein and Robert Strzodka 5.5 Banded Format to Block Tridiagonal Conversion To solve banded systems with tridigpu<t>gbsv, the system is converted into block tridiagonal form while reading the bands from global to on-chip memory, as shown in Figure 3. Afterwards, the RPTS algorithm, can be used to solve the on-chip block tridiagonal system. To save a banded matrix k with + k + 1 bands as a block tridiagonal matrix, the blocks must be ofmax size (k ,k ), and for each matrix row max(k ,k ) − 1 + abs(k − k ) l u l u l u zeros are additionally saved. Therefore, the memory consumption of a block tridiagonal matrix representation max(k ,k )−1+abs(k −k ) u u l l is increased by factor+1 in comparison to the banded matrix representation. For full k +k +1 l u pentadiagonal systemsk( = k = 2) this is an increase of 20%. Note that the block tridiagonal representation of l u the initial banded system is never saved in global GPU memory, but exclusively in on-chip memory. Only the intermediate results in the higher stages of the solver write and read block tridiagonal systems. However, this contributes little to additional data movement, because the coarse systems are small. Instead, the speed of the block tridiagonal solver is bounded by the limited thread parallelism in the computation. −1 5.6 Calculation ofA D with Sparse Right-Hand Sides and Solutions Functiongtsv_csc2csc utilizes gtsv_csc2dense and a k-selection algorithm 16],[which selects the k largest elements with respect to their absolute value, but without sorting them. The right-hand sides are processed in ′ ′ batches ofn vectors. For each n right-hand sides, gtsv_csc2dense is called once, and the k-selection algorithm r r ′ ′ is callend times, which results in n histogram 4 calculations in the worst case for single precision loating-point r r numbers. Our k-selection algorithm is similar BucketSelect to the of Alabi et al. 1],[but has a ixed histogram size of 256 bins. We exploit that positive loating point numbers (IEEE 754-1989) remain their ordering when they are reinterpreted as unsigned integers based on their bit pattern. For single precision loating point numbers, a maximum of four histograms are calculatedkth to ind the largest number. In the irst histogram, each of the 256 bins represent the occurrence of a speciic four byte preix. Afterwards, one single CUDA block calculates the preix sum and determines if an accumulated bin contains exactlyk numbers. In that case, all numbers with the byte preix represented by that bin are selected with CUB’s select_if device primitiv 33].eIf [ there is no accumulated bin, which contains k numbers, the histogram and the preix sum is calculated for the next four byte pattern until a bin contains the remaining number of required maximum elements, to obtain a byte preix, which is used to sele kct maximum the values with CUB’s select_if with respect to their absolute value. For each right-hand side, which is processegtsv_csc2csc d by , k is set to the number of non-zero values in the corresponding sparse result matrix column. 6 RESULTS For the results presented in this paper, we use a machine with a GeForce RTX 2080 Ti, CUDA 11.2.142, CentOS7, GCC 10.2.0, CUDA driver 450.57, and an Intel Xeon Silver 4110 @ 2.10 GHz. The scalar tridiagonal solver performance of RPTS was already evaluated in detail 30].in For[ performance results single-precision is chosen because the GeForce RTX 2080 Ti has very few double-precision arithmetic units. Throughout this section, we use partial pivoting if the pivoting scheme is not mentioned explicitly. 6.1 Numerical Results In this section we analyze the numerical stability of RPTS as a block tridiagonal and bande n = 2), d solv ander ( compare the results, with well established and numerically stable CPU solvers: LAP gbsv A)CK and( the sparse LU decomposition implemented in Eigen3 21], as[well as with the QR factorization implemented in cuSPARSE for the solution of batched pentadiagonal systems DgpsvInterleavedBatch ( ). For RPTS we set the partition size to M = 16 block rows, such that there is only one reduction and one substitution step for the presented test cases. ACM Trans. Parallel Comput. tridigpu: A GPU library for block tridiagonal and banded linear equation systems • 111:15 ID condition number description 1 3.45e+02 pentadiag(U (−1, 1), U (−1, 1), U (−1, 1), U (−1, 1), U (−1, 1)) 2 1.00e+00 pentadiag(U (−1, 1), U (−1, 1), 1e8*ones(N), U (−1, 1), U (−1, 1)) 3 3.45e+02 same as #1, but dll[N/2+1] = 1e-50*dll[N/2+1] 4 3.43e+02 same as #3, but dl[N/2+1] = 1e-50*dl[N/2+1], dll[N/2+2] = 1e-50*dll[N/2+2] 5 3.43e+03 same as #1, but each element ofdll, duu has 50% chance to be zero 6 7.78e+03 same as #1, but each element ofdll, dl, du, and duu has 50% chance to be zero 7 1.06e+00 pentadiag(U (−1, 1), U (−1, 1), 64*ones(N), U (−1, 1), U (−1, 1)) 8 9.98e+14 gallery(’randsvd’, N, 1e15, 2, 2, 2) 9 9.43e+14 gallery(’randsvd’, N, 1e15, 3, 2, 2) 10 1.18e+15 gallery(’randsvd’, N, 1e15, 1, 2, 2) 11 9.65e+14 gallery(’randsvd’, N, 1e15, 4, 2, 2) 12 3.55e+07 same as #1, but dll = dll*1e-50 13 6.76e+34 same as #12, but dl = dl*1e-50 14 1.02e+01 gallery(’toeppen’, N, 1, -10, 0, 10, 1) 15 1.45e+03 pentadiag(U (−1, 1), U (−1, 1), 1e-8*ones(N), U (−1, 1), U (−1, 1)) 16 1.45e+03 pentadiag(U (−1, 1), U (−1, 1), zeros(N), U (−1, 1), U (−1, 1)) 17 3.57e+04 pentadiag(ones(N-2), ones(N-1), 1e-8*ones(N), ones(N-1), ones(N-2)) 18 1.00e+00 pentadiag(ones(N-2), ones(N-1), 1e8*ones(N), ones(N-1), ones(N-2)) 19 2.56e+00 pentadiag(-ones(N-2), -ones(N-1), 8*ones(N), -ones(N-1), -ones(N-2)) 20 2.56e+00 pentadiag(-ones(N-2), ones(N-1), 8*ones(N), ones(N-1), -ones(N-2)) 21 1.51e+00 pentadiag(-ones(N-2), ones(N-1), 8*ones(N), ones(N-1), ones(N-2)) 22 1.54e+00 pentadiag(-ones(N-2), -ones(N-1), 8*ones(N), ones(N-1), U (−1, 1)) 23 1.85e+00 pentadiag(-ones(N-2), -ones(N-1), 8*ones(N), U (−1, 1), U (−1, 1)) Table 4. Pentadiagonal matrix collection for numerical-stability analysis. MATLAB functions are used. Function pentadiag(dll, dl, d, du, duu) returns a pentadiagonal matrix with main diagonal d, lower diagonals dll, dl, and upper diagonals du, duu. U (−1, 1) is the uniform distribution between one and minus one. All matrices are of the same size N . The condition number was calculated with the JacobiSVD function of the Eigen3 library for matrices ofN size= 512. matrix problem DOFs nnz bands LAPACK Eigen3 RPTS RPTS partial pivot no pivot nos2 structural 957 4 137 9 1.90e+00 9.85e-01 2.10e+00 1.03e+01 olm1000 luid dynamics 1 000 3 996 7 8.90e-15 8.12e-15 1.01e-13 9.92e-14 olm2000 luid dynamics 2 000 7 996 7 1.73e-09 1.73e-09 6.92e-09 2.55e-09 olm500 luid dynamics 500 1 996 7 6.27e-13 1.20e-12 4.86e-14 3.29e-14 olm5000 luid dynamics 5 000 19 996 7 2.43e-12 2.82e-12 7.13e-13 7.73e-13 tub1000 luid dynamics 1 000 3 996 5 1.45e-13 1.35e-13 3.89e-14 1.13e-14 linverse statistical 11 999 95 977 91.32e-11 1.46e-11 2.27e-12 2.29e-12 spmsrtls statistical 29 995 229 947 91.50e-14 2.67e-14 9.62e-11 9.62e-11 lf10000 model reduction 19 998 99 982 7 5.29e-12 3.33e-12 3.18e-13 8.52e-13 Table 5. Let: properties of matrices from the Sparse Matric Collection 8]. Right: [ forward relative error of double-precision results for numerical stability analysis of banded solvers. For the evaluation, the solution vector x is generated with a normal distribution of loating-point numbers with a mean value of 3 and standard deviation of 1. The forward relative error is then calculate |x − xd |by /|x | , t 2 t 2 where x is the calculated solutionxand is the exact solution. We create the 2x2 block tridiagonal test matrices by transforming the scalar tridiagonal test matrices 30]: from [ N /2×N /2 N ×N IfA ∈ R is a test matrix from 30][ Table 1, we obtain a 2x2 block tridiagonal test matrix A ∈ R s b ACM Trans. Parallel Comput. 111:16 • Christoph Klein and Robert Strzodka ID LAPACK Eigen3 cuSPARSE RPTS RPTS κ κ partial nopivot partial pivot no pivot(coarse system) 1 2.92e-15 2.11e-15 1.00e-14 6.49e-15 2.98e-14 3.15e+02 2.46e+02 2 1.27e-16 1.27e-16 2.57e-16 1.31e-16 1.31e-16 1.00e+00 1.00e+00 3 2.81e-15 2.05e-15 6.97e-15 5.97e-15 2.86e-14 3.15e+02 2.46e+02 4 2.92e-15 2.07e-15 8.03e-15 5.92e-15 2.91e-14 3.15e+02 2.46e+02 5 5.13e-15 1.28e-14 1.60e-14 5.39e-14 1.85e-13 6.02e+02 6.02e+02 6 9.19e-15 9.19e-15 4.97e-14 8.85e-14 5.74e-13 7.70e+02 6.45e+03 7 1.43e-16 1.42e-16 3.85e-16 1.55e-16 1.55e-16 1.05e+00 1.05e+00 8 6.97e-04 6.97e-04 4.01e-03 2.99e-04 1.32e-01 2.18e+12 2.76e+15 9 8.43e-05 1.65e-04 3.19e-03 3.77e-04 1.69e-04 4.57e+14 1.85e+14 10 4.58e-04 4.06e-04 3.09e-03 2.60e-04 2.57e-04 4.24e+01 1.84e+01 11 7.31e-04 1.53e-03 1.18e-03 3.16e-04 2.12e-01 2.06e+14 2.08e+14 12 8.90e-12 8.90e-12 5.57e-11 5.75e-10 1.21e-10 1.84e+07 1.84e+07 13 2.44e+71 2.90e+71 6.86e+70 6.18e+72 8.24e+70 3.51e+28 5.12e+40 14 3.75e-16 3.65e-16 7.98e-16 3.62e-16 3.62e-16 2.94e+00 2.94e+00 15 9.23e-15 3.31e-15 9.59e-15 1.56e-14 2.39e-13 6.14e+02 2.89e+03 16 4.40e-15 4.54e-15 2.45e-14 7.39e-15 1.62e-13 6.14e+02 2.89e+03 17 5.97e-14 9.20e-14 3.35e-13 7.93e-04 4.96e+00 6.56e+08 4.49e+10 18 1.64e-16 1.63e-16 5.02e-16 1.40e-16 1.40e-16 1.00e+00 1.00e+00 19 2.00e-16 1.99e-16 7.33e-16 1.51e-16 1.51e-16 1.86e+00 1.86e+00 20 1.64e-16 1.65e-16 4.71e-16 1.63e-16 1.63e-16 1.86e+00 1.86e+00 21 1.63e-16 1.65e-16 4.47e-16 1.59e-16 1.59e-16 1.32e+00 1.32e+00 22 1.55e-16 1.55e-16 4.09e-16 1.51e-16 1.51e-16 1.31e+00 1.31e+00 23 1.49e-16 1.49e-16 3.53e-16 1.40e-16 1.40e-16 1.51e+00 1.51e+00 Table 6. Let part: Forward relative error of double-precision results for numerical stability analysis of pentadiagonal solvers. Right part: coarse system condition for RPTS with partial pivoting and without pivoting. The matrix ID refers to the matrices in Table 4. with the same conditionAasby  (A ) ifi = j  s ⌊i /2⌋, ⌊j /2⌋ (A ) ifi − 2 = j s ⌊i /2⌋, ⌊j /2⌋−1 (A ) := (13) ij (A ) ifi + 2 = j s ⌊i /2⌋, ⌊j /2⌋+1 0 otherwise A := O A O , (14) b l r where O and O are orthogonal 2x2 block diagonal matrices. With a uniformly distributed loating-point number l r α ∈ [0, π], a single 2x2 orthogonal blockOinor O is generated by l r cos (α ) − sin (α ) . (15) sin (α ) cos (α ) We create the pentadiagonal test matrices analogously to the tridiagonal numerical test5,cases 11, 30in , 32[, 38], which are listed in Table 4. The double-precision results for 2x2 block tridiagonal matrices are shown in Table 7 and for pentadiagonal matrices in Table 6. The right part of each Table shows the conditioning of the coarse tridiagonal system, which ′ 2N has a size of N = . The conditionNA isif the coarse system could not be calculated due to numerical issues in Mn ACM Trans. Parallel Comput. tridigpu: A GPU library for block tridiagonal and banded linear equation systems • 111:17 ID LAPACK Eigen3 RPTS RPTS κ κ partial nopivot partial pivot no pivot(coarse system) 1 7.41e-15 5.80e-15 1.35e-14 1.26e-14 7.20e+02 4.68e+03 2 1.57e-16 1.57e-16 1.64e-16 1.64e-16 1.00e+00 1.00e+00 3 1.96e-16 1.90e-16 1.95e-16 1.95e-16 2.63e+02 2.63e+02 4 7.81e-15 6.80e-15 1.37e-14 1.35e-14 7.20e+02 8.75e+03 5 1.84e-15 2.54e-15 2.54e-15 9.33e-15 1.46e+02 2.33e+02 6 1.63e-16 1.54e-16 1.71e-16 1.71e-16 1.03e+00 1.03e+00 7 2.66e-16 2.74e-16 2.65e-16 2.65e-16 3.00e+00 3.00e+00 8 1.69e-04 1.99e-04 1.47e-04 3.04e-04 7.41e+12 2.51e+11 9 1.87e-05 5.08e-05 1.84e-02 6.10e-02 1.77e+15 5.36e+19 10 9.37e-05 1.22e-04 9.97e-05 6.24e-05 1.00e+00 6.14e+01 11 2.66e-03 3.14e-03 2.37e-03 1.77e-03 1.66e+14 1.32e+14 12 1.51e+06 9.51e+05 7.08e+05 5.11e+05 9.69e+22 2.48e+23 13 4.67e+00 1.88e+00 3.11e+00 2.83e+00 6.16e+17 5.74e+17 14 5.89e-03 6.91e-03 3.36e-03 7.11e+00 4.96e+14 3.23e+18 15 1.56e+02 3.34e+02 3.71e+02 NA 1.50e+20 NA 16 1.77e-15 2.79e-15 2.57e-15 4.53e-09 4.13e+01 4.13e+01 17 1.68e-16 1.58e-16 1.74e-16 1.74e-16 1.00e+00 1.00e+00 18 1.78e-16 1.77e-16 1.89e-16 1.89e-16 1.73e+00 1.73e+00 19 1.85e-16 1.88e-16 1.81e-16 1.81e-16 1.03e+00 1.03e+00 20 1.77e-16 1.77e-16 1.68e-16 1.68e-16 1.64e+00 1.64e+00 Table 7. Let part: Forward relative error of double-precision results for numerical stability analysis of 2x2 block tridiagonal solvers for matrices of size N = 1024. Right part: coarse system condition for RPTS with partial pivoting and without pivoting. Each matrix was created by transforming the scalar tridiagonal test matrices in [30] to 2x2 block tridiagonal form. the reduction step. The results of RPTS are comparable to that of Eigen3 and LAPACK, with exception of matrix 9 in Table 7; and matrix 12 and 17 in Table 6, which is caused by high coarse system condition numbers as a result of the static partitioning and blocking. In addition to the hard-to-solve synthetic problems, we analyze the forward relative error for real-world problems from the Sparse Matrix Collection in Table 5. Although the Sparse Matrix Collection contains many matrices, only few matrices have at most four sub- and super-bands on each side of the diagonal. In ive cases, RPTS achieves a better error as LAPACK or Eigen3 but for matrices olm1000 and spmsrtls, LAPACK and Eigen3 are clearly better (see Section 3.1). 6.2 Optimizations for Small Tridiagonal Systems Figure 4 displays the performance improvement of the fused inal stage kernel (FFSK) and the dynamic choice ofM in comparison to a inal stage solver with a single CUDA thread and a static choice M. As possible of parameters forM we choose M = 17, 21, 31, 41. As expected, the speedup converges to one for large problem sizes because then the reduction and substitution step of RPTS of the initial tridiagonal system dominate the runtime. Moreover, the cost of the cyclic inal stage is shown, which reduces the tridiagonal system through additional stages down to a 2x2 matrix. 6.3 Multiple Right-Hand Sides Figure 5 shows the performance of tridigpuSgtsv for a CUDA block dimension of n128,= 32 right-hand sides,M = 31, and nˆ = 4 in dependence on the system size N and in comparison to the tridiagonal solvers ACM Trans. Parallel Comput. 111:18 • Christoph Klein and Robert Strzodka 1.6 final stage solver cyclic 1.4 FFSK single CUDA thread 1.2 1.0 10 15 20 10 15 20 2 2 2 2 2 2 N N Fig. 4. Single precison scalar tridiagonal solver performance for diferent RPTS final stage solvers. Let: the corresponding speedup of FFSK relative to a final stage solver with a single CUDA thread. solver cusparseSgtsv2 cusparseSgtsv2_nopivot tridigpuSgtsv 3e−03 1e−03 3e−04 1e−04 3e−05 16 18 20 22 24 26 16 18 20 22 24 26 2 2 2 2 2 2 2 2 2 2 2 2 N N Fig. 5. Single precison scalar tridiagonal solver performance with n = 32 right-hand sides andN unknowns (Section A.4.1). of cuSPARSE [34] (gtsv2 and gtsv2_nopivot), which also support multiple right-hand sides. The equation −6 n N ·10 throughput in [MRows/s] is calculated by , with the runtime in seconds. The optimal choice nˆ and ofM runtime depends on the CUDA architecture and is mainly afected by the amount of available shared memory. For the same coniguration, Figure 7 shows the throughput to global memory for the irst launched reduction kernel and last launched substitution kernel tridigpuSgtsv of , in comparison to a simple copy kernel. The throughputs were calculated from the average time of 200 kernel executions and from the amount of accessed bytes according to Table 2. The reduction and substitution kernel may show higher throughputs as the copy kernel because they perform more load than store instructions, whereas the ratio between loads and stores is exactly one for the copy kernel. For largeN , the calculation in the reduction and substitution kernel is completely hidden behind data movement as they run at the bandwidth limit like the copy kernel. Figure 6 shows that solving multiple right-hand sides within the same kernel is highly tridigpuSgtsv beneicial: for one right-hand side n (= 1) achieves approximately 12500 MRows/s, but for n = 32 it runs at around 31500 r r ACM Trans. Parallel Comput. time [s] speedup of FFSK equation throughput [MRows/s] equation throughput [MRows/s] tridigpu: A GPU library for block tridiagonal and banded linear equation systems • 111:19 right−hand side dense sparse r 1 2 4 8 16 32 1e−02 1e−03 1e−04 16 18 20 22 24 16 18 20 22 24 2 2 2 2 2 2 2 2 2 2 N N Fig. 6. Single precison scalar tridiagonal solver performance of tridigpuSgtsv (Section A.4.1) and tridigpuSgtsv_csc2dense (Section A.4.4) with varying number of right-hand sides and N unknowns. kernel copy reduction substitution 16 18 20 22 24 2 2 2 2 2 Fig. 7. Throughput of the reduction and substitution kernel tridigpuSgtsv of for the finest tridiagonal system (Section A.4.1) withn = 32 right-hand sides andN unknowns, in comparison to a copy kernel, which copiesN single precision elements. All kernels reach the bandwidth limit of the GPU. MRows/s for largeN . The performance increase from n = 1, to n = 2, and to n = 4 is much larger than from r r r n = 4 to n > 4 due to the caching of nˆ = min (n , 4) right-hand sides in on-chip memory. The dotted lines show r r r r the further increase in the performance for sparse right-hand sides with ten non-zero entries and dense solution vectors (tridigpuSgtsv_csc2dense). As expected, the reduction in data movement speeds up the runtime of the kernels. If a CUDA block detects that all right-hand side values are zero within its part of the tridiagonal system, the calculation is skipped and zeros are written to the solution vector immediately. ACM Trans. Parallel Comput. time [s] throughput [GB/s] equation throughput [MRows/s] 111:20 • Christoph Klein and Robert Strzodka tridigpuSbgtfs tridigpuSbgtsv r 1 2 4 8 16 32 block size: 2 block size: 3 block size: 4 1e−01 1e−02 1e−03 1e−04 15 20 15 20 15 20 2 2 2 2 2 2 block size: 2 block size: 3 block size: 4 15 20 15 20 15 20 2 2 2 2 2 2 Fig. 8. Single precison block tridiagonal solver performance oftridigpu with varying number of right-hand sides and N unknowns. bgtsv (Section A.4.2) repeatedly solves the inner partitions, whereas bgtfs (Section A.4.2) uses a given factorization of the tridiagonal system for the solution. 6.4 Block Tridiagonal Systems and their Factorizations Analogously to the scalar tridiagonal solver performance analysis for multiple right-hand sides in Section 6.3, Figure 8 shows the performance of tridigpu block tridiagonal solvers. As expected, solving the system with the already factorized tridiagonal system achieves higher equation throughputs for increasing block sizes because the computationally expensive block inversions and determinent calculations have already been performed in the factorization. In this benchmark, solving the system with the previously calculated factorization is eicient for {n = 2,n ≥ 4}, {n = 3,n ≥ 4}, and {n = 4,n ≥ 1}. The performance of the factorization gtf)(in comparison r r r to the solving stepgtfs ( ) withn = 1 is shown in Figure 9. As expected, the runtime diference increases for increasing block sizes n. ACM Trans. Parallel Comput. equation throughput [MRows/s] time [s] tridigpu: A GPU library for block tridiagonal and banded linear equation systems • 111:21 block size 2 3 4 tridigpuSgtf tridigpuSgtfs 1e−02 1e−03 1e−04 15 20 15 20 2 2 2 2 N N Fig. 9. Single precison block tridiagonal factorization performance tridigpuSgtf ( ) in comparison to the solving step (tridigpuSgtfs, n = 1), which are declared in Section A.4.2. cuSPARSE ILU(0) LAPACK 512 8 solver tile size cusparseSgpsvInterleavedBatch tridigpuSgbsv 64 N bands: 5 bands: 7 bands: 9 6699.61 3417.5 1364.39 10 15 20 25 10 15 20 25 10 15 20 25 2 2 2 2 2 2 2 2 2 2 2 2 Fig. 10. Single precision banded solver performance. FortridigpuSgbsv (Section A.4.3), the maximum equation throughputs are shown. ACM Trans. Parallel Comput. equation throughput [MRows/s] time [s] equation throughput [MRows/s] 111:22 • Christoph Klein and Robert Strzodka 6.5 Banded Systems For optimal performance results on the RTX 2080 Ti, we choose a CUDA block dimensionnof=128, 2, M{ = 13, L = 32}, {n = 3, M = 7, L = 32}, and {n = 4, M = 7, L = 16}, where L denotes the number of partitions per CUDA block. For banded or block tridiagonal systems, M is rather small, due to the shared memory consumption of each block growing quadratically with the block n. For size n = 4, only half a warp is solving the inner partitions in the substitution kernel, whereas still two half warps are active in the reduction kernel, due to the parallel upwards and downwards oriented elimination. We compare the performance against LAP gbsv ACK’s running on the CPU, and against the interleaved batched pentadiagonal solv SgpsvInterleavedBatch er ( ) as well as the ILU(0) solver from cuSPARSE. The ILU(0) solver for fully populated banded systems, is identical to the LU factorization because no additional ill-in occurs. The performance shown for the ILU(0) does not include the time which is required for the calculation of the L und USgpsvInterleavedBatch factors. is optimized for many small pentadiagonal systems, which it into on-chip memory. The banded solv tridigpu e of , can also solve many small systems when batched together into one, but its hierarchical stages are actually designed to solve large pentadiagonal systems. Hence, this comparison is in fav SgpsvInterleavedBatch or of , but this is the only other GPU library with explicit support for at least certain pentadiagonal systems. Figure 10 shows the achieved equation throughput for diferent solvers, number of bands in the matrix and tile sizes. The tile size only afects the performance of the cuSPARSE solvers, and not surprisingly, the performance drops below 3 MRows/s when a single system of size N is solved. For a tile size N ,of one pentadiagonal system is solved, and for a tile size of N /64 64,pentadiagonal systems are solved with one function call. The ILU(0) factorization, which can be used to solve pentadiagonal systems exactly, performs poorly if all bands are full with non-zero entries and thus the solving step is calculated with less parallelism. For pentadiagonal systems, we obtain the amount of bytes readSgpsvInterleavedBatch by forN = 2 from NVIDIA Nsight Compute as tile size 512 64 8 bytes read [MB] 46.149 45.528 32.327 although the size of the 5 bands and right-hand side is equal to 25.166 MB. The reduction and substitution kernel oftridigpu read 51 MB on the irst stage, which is more than twice of the pentadiagonal system, due to reading the coarse solution in the substitution kernel. Nevertheless, tridigpu the banded solve achieves the same performance as the interleaved batched solve from cuSPARSE with a tile size of 64N for . Alarge s expected the equation throughput decreases if the number of bands are increasetridigpu d for due to the increased amount of computational efort, in particular the block inversions with a comple O (n xity ). of −1 6.6 Calculation ofA D with Sparse Right-Hand Sides and Solutions We solve a scalar tridiagonal system AX = D with many right-hand sides D with function gtsv_csc2csc. The results forgtsv_csc2dense with a sparse right-hand side and a dense solution are shown in SectionD6.3. , For we take the transposed matrices of the Sparse Matrix Collection, which are listed in Table 8, and set the CSC column pointers of the result X equal to those ofD. Note, that the matrices from the Sparse Matrix Collection are only used as right-hand sides, and are not inverted. With diferent sizes and number of non-zero elements they provide a test set for analyzing the function’s performance depending on diferent sparsity patterns of the right-hand sides. The coeicients of the tridiagonalAmatrix are generated with a uniform distribution between minus one and one and the number of right-hand sides, which are processed in one batch, is set to n = 128. The corresponding runtimes can be seen in Figure 11. More than half of the runtime is consumed by the selection of the largest absolute values of the dense tridiagonal solver result. single For 10precision elements, the histogram256 calculating kernel ofk-sele the ction algorithm runs with 320 GB/s on the RTX 2080 Ti, and is limited by the inter-warp aggregated irregular atomic additions in shared memory. In comparison to CUB’s ACM Trans. Parallel Comput. tridigpu: A GPU library for block tridiagonal and banded linear equation systems • 111:23 kernel gtsv_csc2dense k_selection matrix Fig. 11. gtsv_csc2csc runtime in single precision for many sparse right-hand sidesD from the Sparse Matrix Collection (Section A.4.5). Name Problem DOFs nnz max_element(nnz ) ecology2 2D/3D 999 999 4 995 991 5 thermal2 thermal 1 228 045 8 580 313 11 geo_1438 structural 1 437 960 63 156 690 57 atmosmodl luid dynamics 1 489 752 10 319 760 7 hook_1498 structural 1 498 023 60 917 445 93 ml_geer structural 1 504 002 110 879 972 74 af_shell10 structural 1 508 065 52 672 325 35 g3_circuit circuit simulation 1 585 478 7 660 826 6 transport structural 1 602 111 23 500 731 15 Table 8. Matrices from the Sparse Matrix Collection 8] [which are used as sparse right-hand sidesD. max_element(nnz ) is the maximum number of non-zeros per row. device histogram 33[], that is a speedup of 1.57. The overall runtime is in order of hundreds of seconds for the chosen matrices, which is caused by the runtime complexity O (n of N ) for aN ×n matrixD, as the hierarchical r r solve of one tridiagonal system is O (Nin ). A faster alternative would be to drop coeicients early D in during the solve, similar to how ILU drops coeicients early A during in a factorization. However, this introduces loss of precision and here we focus on the accuracy of the result. IfA contains many independent small tridiagonal systems other algorithms, which dynamically solve only the independent tridiagonal systems with non-zero right-hand sides, are more eicient. 7 SUMMARY AND OUTLOOK In this section we discuss the limitations and future work for the presented function classes. ACM Trans. Parallel Comput. time[s] ecology2 thermal2 Geo_1438 atmosmodl Hook_1498 ML_Geer af_shell10 G3_circuit Transport 111:24 • Christoph Klein and Robert Strzodka The scalar tridiagonal solve for multiple right-hand sides [c]gtsv ( ) is only limited by GPU memory bandwidth if the GPU has a suicient amount of computational units for the corresponding precision. These functions run at hardware limit and cannot be further optimized. The block tridiagonal solve and factorization are limited by thread execution latencies. In comparison to the scalar solves, the block tridiagonal solves have an increased computational efort, which cannot be hidden behind global memory accesses any more if only one warp is solving the inner partitions. For the future, this motivates us to increase the amount of parallelism in the matrix-matrix or matrix-vector operations during the solving of the inner partitions. However, this is not just a change in the parameterization, a signiicantly diferent parallelization strategy would be required. This new parallelization should overcome the limitation n of ≤ 4 to some extent, maybe up to n ≤ 16. But if we wanted to process even larger n eiciently, then yet another parallel strategy would have to be invented. The sparse tridiagonal solves are eicient for sparse right-hand side input vectors, which it into GPU memory cache because each CUDA block reads all right-hand side elements. Approximately 60% of the runtime ofgtsv_csc2csc is limited by the performance of k-sele the ction algorithm. Therefore, increasing the performance of the device histogram by stronger data privatization or merging k-selethe ction into the substitution kernel of the scalar tridiagonal solver will be investigated. To ensure performance portability across diferent GPU architectures, an auto-tuning mechanism to determine the optimal kernel parameters like the block dimension or partition M would size be useful in the future. 8 CONCLUSION Our tridigpu CUDA library for block cyclic, tridiagonal, and banded systems covers multiple problem classes which often appear in practice. We have discussed the C API, the underlying algorithms, the numerical stability, and the performance of these problem classes in detail. In comparison, we showed that our solvers outperform state-of-the-art tridiagonal solvers as they are implemented in cuSPARSE, and that our banded solver keeps up with batched pentadiagonal solvers, although it is able to solve large pentadiagonal systems and not only batched systems. REFERENCES [1] Tolu Alabi, Russel Steinbach, Jefrey D. Blanchard, and Bradley Gordon. 2012. Fast k-Selection Algorithms for Graphics Processing Units.ACM Journal of Experimental Algorithmics 17, 4 (2012), 4.1ś4.29. https://doi.org/10.1145/2133803.2345676 [2] Pablo Alfaro, Pablo Igounet, and Pablo Ezzatti. 2012. A study on the implementation of tridiagonal systems solvers using a GPU. In Proceedings - International Conference of the Chilean Computer Science Society, SCCC . IEEE, 219ś227. https://doi.org/10.1109/SCCC.2011.29 [3] E Anderson, Z Bai, C Bischof, S Blackford, J Demmel, J Dongarra, J Du Croz, A Greenbaum, S Hammarling, A McKenney, and D Sorensen. 1999. LAPACK Users’ Guide(third ed.). Society for Industrial and Applied Mathematics, Philadelphia, PA. [4] Li-Wen Chang. 2014.Scalable parallel tridiagonal algorithms with diagonal pivoting and their optimization for many-core archite . ctures Ph.D. Dissertation. [5] Li Wen Chang, John A. Stratton, Hee Seok Kim, and Wen Mei W. Hwu. 2012. A scalable, numerically stable, high-performance tridiagonal solver using GP International Us. Conference for High Performance Computing, Networking, Storage and Analysis, (2012). SC https://doi.org/10.1109/SC.2012.12 [6] Xiaojun Chen and Weijun Zhou. 2010. Smoothing nonlinear conjugate gradient method for image restoration using nonsmooth nonconvex minimization. SIAM Journal on Imaging Sciences3, 4 (2010), 765ś790. https://doi.org/10.1137/080740167 [7] Andrew Davidson, Yao Zhang, and John D. Owens. 2011. An auto-tuned method for solving large tridiagonal systems on the GPU. Proceedings - 25th IEEE International Parallel and Distributed Processing Symposium, IPDPS 2011 (2011), 956ś965. https://doi.org/10.1109/ IPDPS.2011.92 [8] Timothy A. Davis and Yifan Hu. 2011. The University of Florida Sparse Matrix Colle ACM Tction. rans. Math. Software 38, 1 (2011). https://doi.org/10.1145/2049662.2049663 [9] Adrian Perez Dieguez, Margarita Amor, and Ramon Doallo. 2016. New Tridiagonal Systems Solvers on GPU Archite Proce ctur edings es. - 22nd IEEE International Conference on High Performance Computing, HiPC 2015 (2016), 85ś93. https://doi.org/10.1109/HiPC.2015.17 ACM Trans. Parallel Comput. tridigpu: A GPU library for block tridiagonal and banded linear equation systems • 111:25 [10] Adrián P. Diéguez, Margarita Amor, and Ramón Doallo. 2019. Parallel preix operations on GPU: tridiagonal system solvers and scan operators. Journal of Supercomputing75, 3 (2019), 1510ś1523. https://doi.org/10.1007/s11227-018-2676-z [11] Adrián P. Diéguez, Margarita Amor, and Ramón Doallo. 2019. Tree partitioning reduction: A new parallel partition method for solving tridiagonal systems. ACM Trans. Math. Software 45, 3 (2019). https://doi.org/10.1145/3328731 [12] Adrián Pérez Diéguez, Margarita Amor, Jacobo Lobeiras, and Ramón Doallo. 2018. Solving large problem sizes of index-digit algorithms on GPU: FFT and tridiagonal system solvers. IEEE Trans. Comput. 67, 1 (2018), 86ś101. https://doi.org/10.1109/TC.2017.2723879 [13] J.W. Eastwood. 1982. Parallel computers: Architecture, programming and algorithms . Vol. 27. Bristol. 104 pages. https://doi.org/10.1016/ 0010-4655(82)90016-9 [14] Jennifer B. Erway, Roummel F. Marcia, and Joseph A. Tyson. 2010. Generalized diagonal pivoting methods for tridiagonal systems without interchanges. IAENG International Journal of Applied Mathematics 40, 4 (2010), 1ś7. [15] S Foulon and K.J. in ’t Hout. 2010. Adi Finite Diference Schemes for OptionInternational Pricing. Journal of Numerical Analysis and Modeling7, 2 (2010), 303ś320. [16] Peter Frolkovič. 1990. Numerical recipes: The art of scientiic computing. Acta Applicandae Mathematicae19, 3 (jun 1990), 297ś299. https://doi.org/10.1007/bf01321860 [17] ABHIJIT GHOSH and CHITTARANJAN MISHRA. 2021. A parallel cyclic reduction algorithm for pentadiagonal systems with application to a convection-dominated heston pde. SIAM Journal on Scientiic Computing 43, 2 (jan 2021), C177śC202. https://doi.org/10.1137/ 20M1311053 [18] Mike Giles, Endre Laszlo, Istvan Reguly, Jeremy Appleyard, and Julien Demouth. 2014. GPU Implementation of Finite Diference Solvers. Proceedings of WHPCF 2014: 7th Workshop on High Performance Computational Finance - Held in conjunction with SC 2014: The International Conference for High Performance Computing, Networking, Storage and Analysis (2014), 1ś8. https://doi.org/10.1109/WHPCF.2014.10 [19] Andrew Gloster, Lennon Ó Náraigh, and Khang Ee Pang. 2019. cuPentBatchÐA batched pentadiagonal solver for NVIDIA GPUs. Computer Physics Communications 241 (2019), 113ś121. https://doi.org/10.1016/j.cpc.2019.03.016 [20] Dominik Göddeke and Robert Strzodka. 2011. Cyclic reduction tridiagonal solvers on GPUs applied to mixed-precision IEEE multigrid. Transactions on Parallel and Distributed Systems 23, 1 (2011), 22ś32. https://doi.org/10.1109/TPDS.2010.61 [21] Gaël Guennebaud, Benoît Jacob, and Others. 2012. Eigen v3. http://eigen.tuxfamily.org. http://eigen.tuxfamily.org [22] George R. Halliwell. 2004. Evaluation of vertical coordinate and vertical mixing algorithms in the HYbrid-Coordinate Ocean Model (HYCOM). Ocean Modelling7, 3-4 (2004), 285ś322. https://doi.org/10.1016/j.ocemod.2003.10.002 [23] R. W. Hockney. 1965. A Fast Direct Solution of Poisson’s Equation Using Fourier Analysis. Journal of the ACM (JACM) 12, 1 (1965), 95ś113. https://doi.org/10.1145/321250.321259 [24] Rabia Kamra and S. Chandra Sekhara Rao. 2021. A stable parallel algorithm for block tridiagonal toeplitzśblockśtoeplitz linear systems. Mathematics and Computers in Simulation 190 (2021), 1415ś1440. https://doi.org/10.1016/j.matcom.2021.07.019 [25] Michael Kass, Aaron Lefohn, and John Owens. 2006. Interactive Depth of Field Using Simulated Difusion Computing on a GPU.January 2006 (2006), 1ś8. [26] Michael Kass and Gavin Miller. 1990. Rapid, stable luid dynamics for computer graphics. 24, 4 (1990), 49ś57. https://doi.org/10.1145/ 97879.97884 [27] Hee Seok Kim, Shengzhao Wu, Li Wen Chang, and Wen Mei W. Hwu. 2011. A scalable tridiagonal solver for ProGP ceedings Us. of the International Conference on Parallel Processing(2011), 444ś453. https://doi.org/10.1109/ICPP.2011.41 [28] Ki Ha Kim, Ji Hoon Kang, Xiaomin Pan, and Jung Il Choi. 2021. PaScaL_TDMA: A library of parallel and scalable solvers for massive tridiagonal systems. Computer Physics Communications 260 (2021), 107722. https://doi.org/10.1016/j.cpc.2020.107722 [29] Christoph Klein and Robert Strzodka. [n.d.]. tridigpu. https://mp-force.ziti.uni-heidelberg.de/asc/code/tridigpu [30] Christoph Klein and Robert Strzodka. 2021. Tridiagonal GPU Solver with Scaled Partial Pivoting at MaximumABandwidth. CM In International Conference Proceeding Series . ACM, New York, NY, USA, 1ś10. https://doi.org/10.1145/3472456.3472484 [31] Endre László, Mike Giles, and Jeremy Appleyard. 2016. Manycore algorithms for batch scalar and block tridiagonal ACMsolv Trans. ers. Math. Software 42, 4 (2016). https://doi.org/10.1145/2830568 [32] Jingmei Li, Zhigao Zheng, Qiao Tian, Guoyin Zhang, Fangyuan Zheng, and Yuanyuan Pan. 2017. Research on tridiagonal matrix solver design based on a combination of processors. Computers and Electrical Engineering 62 (2017), 1ś16. https://doi.org/10.1016/j. compeleceng.2017.07.014 [33] Duane Merrill. 2021. CUB. https://nvlabs.github.io/cub [34] NVIDIA. 2022. CUSPARSE library. In NVIDIA Corporation, Santa Clara, California . https://docs.nvidia.com/cuda/cusparse [35] Adrian Perez DIeguez, Margarita Amor Lopez, and Ramon Doallo Biempica. 2018. Solving Multiple Tridiagonal Systems on a Multi-GPU Platform.Proceedings - 26th Euromicro International Conference on Parallel, Distributed, and Network-Based Processing, PDP 2018 (2018), 759ś763. https://doi.org/10.1109/PDP2018.2018.00123 [36] Thomas H. Pulliam. 1986. Implicit solution methods in computational luidApplie dynamics. d Numerical Mathematics 2, 6 (1986), 441ś474. https://doi.org/10.1016/0168-9274(86)90002-4 ACM Trans. Parallel Comput. 111:26 • Christoph Klein and Robert Strzodka [37] L Thomas. 1949. Elliptic problems in linear diferential equations overWaatson netwSci. ork. Comput. Lab Report, Columbia University, New York 1 (1949), 42. [38] I. E. Venetis, A. Kouris, A. Sobczyk, E. Gallopoulos, and A. H. Sameh. 2015. A direct tridiagonal solver based on Givens rotations for GPU architectures.Parallel Comput. 49 (2015), 101ś116. https://doi.org/10.1016/j.parco.2015.03.008 [39] H. H. Wang. 1981. A Parallel Method for Tridiagonal Equations. ACM Transactions on Mathematical Software (TOMS)7, 2 (1981), 170ś183. https://doi.org/10.1145/355945.355947 [40] Xiaojing Wang and Z. G. Mou. 1991. A divide-and-conquer method of solving tridiagonal systems on hypercube massively parallel computers. Proceedings of the 3rd IEEE Symposium on Parallel and Distributed Processing 1991 (1991), 810ś817. https://doi.org/10.1109/ SPDP.1991.218237 [41] Technical Whitepaper. 2010. Semantic Versioning Semantic Versioning. , 10 pages. ACM Trans. Parallel Comput. tridigpu: A GPU library for block tridiagonal and banded linear equation systems • 111:27 APPENDIX A TRIDIAGONAL LIBRARY tridigpu [29] is published under the the 3-Clause BSD License and is intended to be backward compatible at the source level, and follows the semantic versioning guidelines introduced by Tom Preston-Werner [41]. A.1 Naming Conventions We follow the general BLAS naming conventions and provide a C-API tridigpu<t><format><operation>, where <t> can be S, D, C, Z, which represent the typesfloat, double, cuComplex, and cuDoubleComplex, respectively. As a placeholder for the corresponding types we<Tuse >. We use [c] to indicate an optional letter c in the function name, which refers to the version of the function for cyclic systems. A.2 Data Format and Layouts A.2.1 Dense matrices. All dense matrices are saved in column-major order. Therefore, the multiple right-hand sidesX are saved consecutively in memory. E.g. the right-hand sides of Equation 3 are saved as D = [1, 2, 3, 4, 5, 6, 7, 8, ..., 17, 18]. A.2.2 Tridiagonal matrices. The tridiagonal system is saved in a banded format, which are three separate bufers (a , b , c ) each of lengthNn . Thus, matrixA of Equation 2 is saved as i i i a = [ 1, 2, 3, 4, 5, 6] b = [ 7, 8, 9, 10, 11, 12] c = [13, 14, 15, 16, 17, 18], whereas a[0] = 0 and c[5] = 0 for the tridiagonal system of Equation 3. A.2.3 Block tridiagonal matrices. For block tridiagonal systems, each block is saved as a dense matrix in column- major order, and the tridiagonal bands of Equation 4 are saved as a = [ 0, 0, 0, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12] b = [13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28] c = [29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 0, 0, 0, 0]. A.2.4 Banded matrices. We save banded matrices from the the lowest to the highest band. . Thus, the banded matrix from Equation 5 is saved as kl = ku = 2 AB = [ *, *, 1, 4, 5, 8, 9, 12, *, 14, 3, 18, 7, 22, 11, 26, 13, 16, 17, 20, 21, 24, 25, 28, 15, 30, 19, 34, 23, 38, 27, *, 29, 32, 33, 36, 37, 40, *, *]. Analogously to the scalar case in Section A.2.2, the placeholders * contain coeicients in case of a cyclic system and are unused otherwise. The bufer has a length of N (k + k + 1), where k , k is the number of bands below l u l u and above the diagonal band, respectively. The diagonal is always included in this format, and the distances of the bands to the diagonal are ixed, e.g., for the lower bands: −k , −(k − 1), . . . , −1. l l ACM Trans. Parallel Comput. 111:28 • Christoph Klein and Robert Strzodka A.2.5 Sparse matrices. Sparse matrices are saved in a compressed sparse column format (CSC) with zero-based indexing. The CSC matrix D of Equation 6 is saved as values = [1, 5, 10, 11, 2, 12, 3, 8, 13, 6, 9, 4, 7] row_indices = [0, 1, 5, 6, 0, 6, 0, 3, 6, 1, 3, 0, 1] col_ptrs = [0, 4, 6, 9, 11, 12, 13]. A.3 Resource Consumption RPTS is a hierarchical algorithm, which must save intermediate results during the processing of higher stages. The amount of required temporary memory depends on various parameters like the data type, the partition size, the number of stages in the hierarchy, the size of the blocks, and the number of right-hand sides. When the original tridiagonal equation system size is calculate N (n d+with 3), a scalar tridiagonal solve in single- precision with tridigpuSgtsv requires approximately between 5% and 11% as temporary memory, whereas cusparseSgtsv2 [34] requires between 100% and 110% of the tridiagonal system byte size. Mortridigpu eover, does not require any global initialization or inalization functions. A.4 C API In this section the function declarations of the C-tridigpu API of are presented. Most functions take a pa- rameter to control the pivoting scheme, which is enum anof typetridigpu_pivoting_t with the instances TRIDIGPU_NO_PIVOTING, TRIDIGPU_PARTIAL_PIVOTING, TRIDIGPU_SCALED_PARTIAL_PIVOTING, and TRIDIGPU_DEFAULT_PIVOTING. Many functions require a (tempo- rary) bufer of GPU memory for execution, the size of which is returned by the corresptridigpu< onding t><format><operation>_bufferSizeExt function. ACM Trans. Parallel Comput. tridigpu: A GPU library for block tridiagonal and banded linear equation systems • 111:29 A.4.1 Solve (cyclic) scalar tridiagonal systems with multiple right-hand sides. This function solves Problem Class Cyclic and Problem Class Scalar, which are (cyclic) general tridiagonal systems. tridigpu_status_t tridigpu<t>[c]gtsv_bufferSizeExt(tridigpu_pivoting_t p, int N, int n_r, size_t* buffer_size_in_bytes) tridigpu_status_t tridigpu<t>[c]gtsv(cudaStream_t stream, tridigpu_pivoting_t p, int N, int n_r, const <T>* a, const <T>* b, const <T>* c, <T>* X, const <T>* D, void* p_buffer) The function does not allocate any extra storage, enqueues the calculation into the CUDA stream, and is non- blocking, i.e., it may return control back to the host before the result is ready. For the solution of cyclic systems, cgtsv is provided. Input Description p recommended value:TRIDIGPU_DEFAULT_PIVOTING (see above) N size of the tridiagonal system n_r number of right-hand sides a lower band of tridiagonal matrix (length N) b diagonal of tridiagonal matrix (Nlength ) c upper diagonal of tridiagonal matrix (Nlength ) D dense right-hand side matrix and bufer of length N * n_r p_buffer temporarily required memory bufer on the GPU Output Description X dense solution matrix and bufer of length N * n_r. Bufer D may be passed also asX, then the solution replaces the right-hand side data after kernel execution. Example: size_t buffer_size_in_bytes = 0; if (tridigpuSgtsv_bufferSizeExt(p, N, n_r, &buffer_size_in_bytes)) { printf("determining temporary buffer size failed\n"); void* p_buffer; if (cudaMalloc(&p_buffer, buffer_size_in_bytes)) { printf("allocating temporary buffer failed\n"); if (tridigpuSgtsv(stream, p, N, n_r, a, b, c, X, D, p_buffer)) { printf("solving tridiagonal system failed\n"); ACM Trans. Parallel Comput. 111:30 • Christoph Klein and Robert Strzodka A.4.2 Solve block tridiagonal systems with multiple right-hand sides. The functions in this Section solve Problem Class Block. tridigpu_status_t tridigpu<t>[c]bgtsv_bufferSizeExt(tridigpu_pivoting_t p, int N, int n_r, int n, size_t* buffer_size_in_bytes) tridigpu_status_t tridigpu<t>[c]bgtsv(cudaStream_t stream, tridigpu_pivoting_t p, int N, int n_r, int n, const <T>* a, const <T>* b, const <T>* c, <T>* X, const <T>* D, void* p_buffer) The function does not allocate any extra storage, enqueues the calculation into the CUDA stream, and is non- blocking, i.e., it may return control back to the host before the result is ready. The function is available for block (bgtsv) and cyclic block general tridiagonal systems cbgtsv ( ). Input Description p recommended value:TRIDIGPU_DEFAULT_PIVOTING (see above) N size of the block tridiagonal system n_r number of right-hand sides n size of one block; allowed values are 1, 2, 3, 4 a lower band of tridiagonal matrix (length N * n * n) b diagonal of tridiagonal matrix (Nlength * n * n) c upper diagonal of tridiagonal matrix (Nlength * n * n) D dense right-hand side matrix and bufer of length N * n_r * n p_buffer temporarily required memory bufer on the GPU Output Description X dense solution matrix and bufer of length N * n_r * n. Bufer D may be passed also asX, then the solution replaces the right-hand side data after kernel execution. Example: size_t buffer_size_in_bytes = 0; if (tridigpuSbgtsv_bufferSizeExt(p, N, n_r, n, &buffer_size_in_bytes)) { printf("determining temporary buffer size failed\n"); void* p_buffer; if (cudaMalloc(&p_buffer, buffer_size_in_bytes)) { printf("allocating temporary buffer failed\n"); if (tridigpuSbgtsv(stream, p, N, n_r, n, a, b, c, X, D, p_buffer)) { printf("solving block tridiagonal system failed\n"); ACM Trans. Parallel Comput. tridigpu: A GPU library for block tridiagonal and banded linear equation systems • 111:31 Functionbgtsv factorizes the matrix with every function call, wher bgtfeas calculates the factorization explicitly once, and bgtfs uses it to compute the solution from a given right-hand side. The explicit factorization is more eicient if the kernels tridigpu of are not memory bound any more. tridigpu_status_t tridigpu<t>bgtf_factorizationBufferSizeExt(tridigpu_pivoting_t p, int N, int n, size_t* factorization_buffer_size_in_bytes) tridigpu_status_t tridigpu<t>bgtf(cudaStream_t stream, tridigpu_pivoting_t p, int N, int n, <T>* a, <T>* b, <T>* c, void* p_factorization_buffer) The function does not allocate any extra storage, enqueues the calculation into the CUDA stream, and is non- blocking, i.e., it may return control back to the host before the result is ready. Input Description p recommended value:TRIDIGPU_DEFAULT_PIVOTING (see above) N size of the block tridiagonal system n size of one block; allowed values are 1, 2, 3, 4 a lower band of tridiagonal matrix (length N * n * n) b diagonal of tridiagonal matrix (Nlength * n * n) c upper diagonal of tridiagonal matrix (Nlength * n * n) Output Description a lower band of factorized tridiagonal matrixN(length * n * n) b diagonal of factorized tridiagonal matrixN (*length n * n) c upper diagonal of factorized tridiagonal matrixN (*length n * n) p_factorization_buffer bufer to save additional data required to factorize the matrix The tridiagonal system is solved for a given factorization and a corresponding right-hand bgtfs side . with tridigpu_status_t tridigpu<t>bgtfs_bufferSizeExt(tridigpu_pivoting_t p, int N, int n_r, int n, size_t* buffer_size_in_bytes) tridigpu_status_t tridigpu<t>bgtfs(cudaStream_t stream, tridigpu_pivoting_t p, int N, int n_r, int n, const <T>* a, const <T>* b, const <T>* c, <T>* X, const <T>* D, const void* p_factorization_buffer, void* p_buffer) The function does not allocate any extra storage, enqueues the calculation into the CUDA stream, and is non- blocking, i.e., it may return control back to the host before the result is ready. Analogously, LAPACK provides gttrf and gttrs. ACM Trans. Parallel Comput. 111:32 • Christoph Klein and Robert Strzodka Input Description p recommended value:TRIDIGPU_DEFAULT_PIVOTING (see above) N size of the block tridiagonal system n_r number of right-hand sides n size of one block; allowed values are 1, 2, 3, 4 a lower band of factorized tridiagonal matrixN(length * n * n) b diagonal of factorized tridiagonal matrixN (*length n * n) c upper diagonal of factorized tridiagonal matrixN (*length n * n) D dense right-hand side matrix and bufer of length N * n_r * n p_factorization_buffer bufer with additional data of the factorized matrix p_buffer temporarily required memory bufer on the GPU Output Description X dense solution matrix and bufer of length N * n_r * n. Bufer D may be passed also asX, then the solution replaces the right-hand side data after kernel execution. Example: size_t factorization_buffer_size_in_bytes = 0; if (tridigpuSbgtf_factorizationBufferSizeExt(p, N, n_r, n, &factorization_buffer_size_in_bytes)) { printf("determining factorization buffer size failed\n"); void* p_factorization_buffer; if (cudaMalloc(&p_factorization_buffer, factorization_buffer_size_in_bytes)) { printf("allocating factorization buffer failed\n"); if (tridigpuSbgtf(stream, p, N, n_r, n, a, b, c, p_factorization_buffer)) { printf("factorizing block tridiagonal system failed\n"); size_t buffer_size_in_bytes = 0; if (tridigpuSbgtfs_bufferSizeExt(p, N, n_r, n, &buffer_size_in_bytes)) { printf("determining temporary buffer size failed\n"); void* p_buffer; if (cudaMalloc(&p_buffer, buffer_size_in_bytes)) { printf("allocating temporary buffer failed\n"); while(...) { // - set right-hand sides `D` // - `p_buffer` does not need to be initialized before the call to `tridigpuSbgtfs` // - `p_factorization_buffer`,`a`,`b`,`c` is written by `tridigpuSbgtf` and read by `tridigpuSbgtfs` if (tridigpuSbgtfs(stream, p, N, n_r, n, a, b, c, X, D, p_factorization_buffer, p_buffer)) { printf("solving block tridiagonal system with factorization failed\n"); ACM Trans. Parallel Comput. tridigpu: A GPU library for block tridiagonal and banded linear equation systems • 111:33 A.4.3 Solve banded systems with multiple right-hand sides. This function solves Problem Class DIA. tridigpu_status_t tridigpu<t>gbsv_bufferSizeExt(tridigpu_pivoting_t p, int N, int n_r, int kl, int ku, size_t* buffer_size_in_bytes) tridigpu_status_t tridigpu<t>gbsv(cudaStream_t stream, tridigpu_pivoting_t p, int N, int n_r, int kl, int ku, const <T>* AB, <T>* X, const <T>* D, void* p_buffer) The function does not allocate any extra storage, enqueues the calculation into the CUDA stream, and is non- blocking, i.e., it may return control back to the host before the result is ready. The numeric behaviour of this function is equaltridigpu<t>bgtsv to and the banded system is converted on-chip into a block tridiagonal system. Input Description p recommended value:TRIDIGPU_DEFAULT_PIVOTING (see above) N size of the banded system n_r number of right-hand sides kl number of lower bands in AB; kl=1,2,3,4 ku number of upper bands in AB; ku=1,2,3,4 AB banded storage format of matrix and bufer of length N * (kl + ku + 1) D dense right-hand side matrix and bufer of length N * n_r p_buffer temporarily required memory bufer on the GPU Output Description X dense solution matrix and bufer of length N * n_r. Bufer D may be passed also asX, then the solution replaces the right-hand side data after kernel execution. Example: size_t buffer_size_in_bytes = 0; if (tridigpuSgbsv_bufferSizeExt(p, N, n_r, ku, ku, &buffer_size_in_bytes)) { printf("determining temporary buffer size failed\n"); void* p_buffer; if (cudaMalloc(&p_buffer, buffer_size_in_bytes)) { printf("allocating temporary buffer failed\n"); if (tridigpuSgbsv(stream, p, N, n_r, kl, ku, AB, X, D, p_buffer)) { printf("solving banded system failed\n"); ACM Trans. Parallel Comput. 111:34 • Christoph Klein and Robert Strzodka A.4.4 Solve scalar tridiagonal systems with multiple sparse right-hand sides and dense solutions. This function solves Problem Class ScalarCSC2Dense for D saved in a CSC format. tridigpu_status_t tridigpu<t>gtsv_csc2dense_bufferSizeExt(tridigpu_pivoting_t p, int N, int n_r, size_t* buffer_size_in_bytes) tridigpu_status_t tridigpu<t>gtsv_csc2dense(cudaStream_t stream, tridigpu_pivoting_t p, int N, int n_r, const <T>* a, const <T>* b, const <T>* c, <T>* X, const <T>* rhs_values, const int* rhs_row_indices, const int* rhs_col_ptrs, int rhs_nnz, void* p_buffer) The function does not allocate any extra storage, enqueues the calculation into the CUDA stream, and is non- blocking, i.e., it may return control back to the host before the result is ready. Input Description p recommended value:TRIDIGPU_DEFAULT_PIVOTING (see above) N size of the tridiagonal system n_r number of right-hand sides a lower band of tridiagonal matrix (length N) b diagonal of tridiagonal matrix (Nlength ) c upper diagonal of tridiagonal matrix (Nlength ) rhs_values value array of sparse right-hand sides in CSC format (length rhs_nnz) rhs_row_indices row index array of sparse right-hand sides in CSC format (length rhs_nnz) rhs_col_ptrs column pointer array of sparse right-hand sides in CSC format (n_r length + 1) rhs_nnz number of non-zero entries of sparse right-hand side matrix D p_buffer temporarily required memory bufer on the GPU Output Description X dense solution matrix and bufer of length N * n_r ACM Trans. Parallel Comput. tridigpu: A GPU library for block tridiagonal and banded linear equation systems • 111:35 Example: size_t buffer_size_in_bytes = 0; if (tridigpuSgtsv_csc2dense_bufferSizeExt(p, N, n_r, &buffer_size_in_bytes)) { printf("determining temporary buffer size failed\n"); void* p_buffer; if (cudaMalloc(&p_buffer, buffer_size_in_bytes)) { printf("allocating temporary buffer failed\n"); if (tridigpuSgtsv_csc2dense(stream, p, N, n_r, a, b, c, X, rhs_values, rhs_row_indices, rhs_col_ptrs, rhs_nnz, , p_buffer)) { printf("solving tridiagonal system with sparse rhs failed\n"); ACM Trans. Parallel Comput. 111:36 • Christoph Klein and Robert Strzodka A.4.5 General tridiagonal solve for many sparse right-hand sides and sparse solutions. This function solves Problem −1 Class ScalarCSC2CSC, i.e., it calculates A D for a general tridiagonal matrix A and a sparse matrixD. For large sparse matricesD the dense solution matrix would exceed the available memory, therefore, this functions −1 outputs only a sparse result matrix X. In fact,X = prune(A D, S ), thus the result is equal to the dense result pruned to a sparsity patternS . S is determined such that the maximum absolute values in the dense result are p p included in the sparse result. nnz If is the number of non-zero entries in column j of matrix X, and for eachj let −1 σ : [0, N − 1] → [0, N − 1] be an index permutation such that the j-columnX ofX := A D is sorted in σ (i ),j descending order with respect to the absolute values, then S is obtained by 1 if σ (i ) < nnz j j (S ) = (16) p i,j 0 otherwise . In words: the sparsity pattern of X is controlled by the user given column pointers nnz ),(and each column of the sparse result is illed with the maximum values of the dense result with respect to their absolute value. tridigpu_status_t tridigpu<t>gtsv_csc2csc(tridigpu_pivoting_t p, int N, const <T>* a, const <T>* b, const <T>* c, const <T>* rhs_values, const int* rhs_row_indices, const int* rhs_col_ptrs, int n_r, int rhs_nnz, <T>* result_values, int* result_row_indices, const int* result_col_ptrs, int result_nnz) The function allocates extra GPU storage, enqueues the calculation into the default CUDA stream, and is blocking the host, i.e., it returns control back to the host when the result is ready. Input Description p recommended value:TRIDIGPU_DEFAULT_PIVOTING (see above) N size of the tridiagonal system a lower band of tridiagonal matrix (length N) b diagonal of tridiagonal matrix (Nlength ) c upper diagonal of tridiagonal matrix (Nlength ) rhs_values CSC values of right-hand sides D (lengthnnz) rhs_row_indices CSC row indices of right-hand sides D (lengthnnz) rhs_col_ptrs CSC column pointers of right-hand sides D (lengthN + 1) n_r number of right-hand sides rhs_nnz number of non-zero entries of right-hand sides D result_col_ptrs CSC column pointers of the result matrix (length N + 1) result_nnz number of non-zero entries of the result matrix. Output Description result_values CSC values of sparse solution matrix X (lengthresult_nnz) result_row_indices CSC row indices of sparse solution matrix X (lengthresult_nnz) ACM Trans. Parallel Comput.

Journal

ACM Transactions on Parallel ComputingAssociation for Computing Machinery

Published: Mar 29, 2023

Keywords: Tridiagonal solver

There are no references for this article.