Method level

Cholesky method

From Algowiki
Jump to navigation Jump to search


Main authors: Igor Konshin

Contents

1 The Cholesky decomposition (or the square-root method): a basic dot version for dense real symmetric positive definite matrices

1.1 The [math]LL^T[/math] decomposition

The Cholesky decomposition (or the Cholesky factorization) is a decomposition of a symmetric positive definite matrix [math]A[/math] into the product [math]A = LL^T[/math], where the factor [math]L[/math] is a lower triangular matrix with strictly positive diagonal elements. In some cases it is convenient to rewrite this decomposition in its equivalent form [math]A = U^TU[/math], where [math]U = L^T[/math] is an upper triangular matrix. The Cholesky decomposition always exists and is unique for any symmetric positive definite matrix, since the diagonal elements of the matrix [math]L[/math] are strictly positive.

The following formulas should be used to find the elements of the matrix [math]L[/math] starting with the upper left-hand corner of the original matrix [math]A[/math]:

[math] \begin{align} \ell_{ii} & = \sqrt{a_{ii} - \sum_{k=1}^{i-1} \ell_{ik}^2}, \\ \ell_{ij} & = \frac{1}{\ell_{jj}} \left(a_{ij} - \sum_{k=1}^{j-1} \ell_{ik} \ell_{jk} \right), \quad j \lt i. \end{align} [/math]

The above expression under the square root sign is always positive if the matrix [math]A[/math] is real symmetric and positive definite.

The computations are performed from the top down and from left to right: the value of [math]l_{ij}[/math] ([math]j \lt i[/math]) is first determined and, then, the value of [math]l_{ii}[/math] is found. The following computational algorithms are usually employed:

  • the Cholesky–Banachiewicz algorithm (or the Cholesky algorithm) when the computations are started from the upper left-hand corner of the matrix [math]L[/math] in the row-wise manner; this version of the decomposition is most commonly used, especially when the row-wise format of matrix storage is chosen for the matrix [math]L[/math];
  • the Crout version of the Cholesky algorithm (or the Cholesky–Crout algorithm) when the computations are also started from the upper left-hand corner of the matrix [math]L[/math] but in a column-wise order; this version is more rarely used and is employed when the column-wise format of matrix storage is chosen for the matrix [math]L[/math] and when the pivot elements should be modified in approximate decompositions.

Both these algorithms can be used when the factor [math]L[/math] is stored in the lower triangular part of the original matrix [math]A[/math].

A basic dot version of the Cholesky algorithm for dense real symmetric positive definite matrices is extensively analyzed in The Cholesky decomposition (the square root method).

1.2 The [math]LDL^T[/math] decomposition

In a number of cases, it is more convenient to use the [math]LDL^T[/math] version of the symmetric triangular decomposition when the matrix [math]L[/math] is lower unitriangular (its diagonal elements are equal to 1) and [math]D[/math] is a diagonal matrix with positive elements. This version of decomposition can easily be related to the above [math]LL^T[/math] version:

[math]A = LDL^T = LD^{1/2}D^{1/2}L^T = (LD^{1/2})\,(LD^{1/2})^T = \tilde L \tilde L^T[/math]

and with the [math]LU[/math] decomposition:

[math]A = LDL^T = L(DL^T) = LU[/math].

2 The Cholesky decomposition: a block version for dense real symmetric positive definite matrices

A block version of the Cholesky decomposition can also be considered. Let [math]n=MN[/math]. Then, the original matrix [math]A[/math] of order [math]n[/math] can be represented as a block matrix of order [math]N[/math] with blocks of order [math]M[/math]. All the formulas used in the dot version of the Cholesky decomposition remain almost the same for the block representation of the matrix [math]A[/math]. Instead of inverting the diagonal blocks, it is more efficient to store these blocks in their factored form [math]D_{ii}=L_{ii}L^T_{ii}[/math] and to solve the corresponding linear triangular systems. The total number of arithmetic operations remains almost the same, but the locality of the algorithm increases considerably. The block of order [math]M[/math] should be chosen in such a way that all the blocks used during elimination can be stored in the first-level or second-level cache. Hence, the data swapping becomes minimal.

A similar technique is also useful for an efficient implementation of a parallel version for the Cholesky decomposition, which allows one to minimize the total number of data exchange and the amount of data transferred among the processors. Another advantage of employing the block version of the Cholesky decomposition consists in the fact that the scalar efficiency of the algorithm can be increased by using the loop unrolling technique for the inner loops over the block order [math]M[/math].

3 The Cholesky decomposition: a dot version for real sparse symmetric positive definite matrices

If the original matrix [math]A[/math] is sparse, it is reasonable to take into account this property to reduce the storage and computational cost.

3.1 Major distinctions from the case of dense matrices

Here we consider the following sparsity types that characterize the ways of matrix storage.

1. Band. If the nonzero elements of a symmetric matrix are concentrated inside a band of width [math]2d+1[/math], then this matrix is called a symmetric band matrix. More precisely, a symmetric matrix [math]A[/math] is said to be a symmetric band matrix if [math]a_{ij}=0[/math] for [math]|i-j|\gt d[/math]. During the Cholesky decomposition, new nonzero elements that originally were zero may appear only inside this band, since no pivoting is done because of positive definiteness. In the original matrix [math]A[/math] and in the lower triangular factor [math]L[/math], the number of nonzero elements is about [math](d+1)n[/math], whereas the number of arithmetic operations is about [math]d^2n[/math]. Note that the situation when an original zero element becomes nonzero during elimination is also known as the generation of fill-in elements (or the generation of fill-ins).

2. Profile. In a more general case. the generation of fill-ins in each row of the lower triangular factor [math]L[/math] is specified by the position of the first nonzero element. The sum of the distances from the first elements of all rows to the main diagonal forms the "profile" of a matrix. In other words, the profile of a symmetric matrix is a measure of how close its elements are to the main diagonal and specifies an upper bound for the number of nonzero elements in the factor [math]L[/math].

3. General sparsity pattern. An upper bound for the number of nonzero elements in the factor [math]L[/math] is the value of its "profile"; however, the consideration of the peculiarities in the structure of nonzero elements inside the profile may sometimes increase the performance efficiency.

In the case of general sparse matrices, it is necessary to choose a storage format for matrix elements. As an example, we mention the "compressed sparse row" (CSR) format. In this format, the nonzero elements are stored row by row in the first real array; the second integer array contains the column indices of the nonzero elements stored in the first array; and the third integer array contains the pointers (or indices) to the beginning of each row in the first and second arrays. By nnz ("number of nonzeros") we denote the total number of the nonzero elements. In double precision, then, the amount of storage required by the CSR format is equal to [math]3\,{\rm nnz}+n+1[/math], where [math]n[/math] is the number of rows in the matrix. Generally, the amount of arithmetic operations cannot be estimated, since this amount depends not only on the number of nonzero elements but is essentially dependent on the sparsity structure of the matrix.

In order to implement the Cholesky decomposition in this case, the following operations should be used for sparse rows.

  • copying a sparse row to another one (or to a working "dense" vector, the data unpacking operation);
  • elimination operation for one of the row elements;
  • insertion of a new nonzero element (known also as a "fill-in" element) into a row;
  • data compression with copying a working dense vector to a compressed sparse vector (the data packing operation).

3.2 Reordering to reduce the number of fill-in elements

The structure of the lower triangular factor [math]L[/math] and the amount of the required storage are dependent on the ordering of the rows and columns of the original matrix [math]A[/math]. There exist algorithms minimizing the number of fill-ins in the matrix [math]L[/math]. Here we mention the following algorithms of such a type.

  • The RCM (Reverse Cuthill–McKee) algorithm to reduce the profile of a sparse symmetric matrix (or its bandwidth). In addition to the profile reduction, it is possible to reduce the number of fill-ins in the factor [math]L[/math]. This algorithm is widely used in practice but is not the most efficient.
  • The ND (Nested Dissection) algorithm is intended to minimize the number of fill-ins in the factor [math]L[/math]. Its asymptotic optimality can be proved for some particular cases.

In the general case, the problem of searching for the permutation that minimizes the number of fill-ins in the factor [math]L[/math] can be considered as an NP-complete problem.

4 The Cholesky decomposition: a block version for sparse real symmetric positive definite matrices

In a number of cases, it is convenient to represent a sparse symmetric matrix in a block form with blocks of small order [math]M[/math]. For example, the value of [math]M[/math] can be chosen to be equal to the number of unknown functions per a node when using various finite element or finite difference approximations of partial differential equations. The sparsity structure is stored in computer memory for the entire block sparsity structure, which allows one to save storage for integer arrays. Let nnz be the number of nonzero blocks of order [math]M[/math]. Then, the amount of memory storage required for such small block matrices is equal to [math](2M^2+1)\,{\rm nnz}+n/M+1[/math] if the CSR format and the double precision mode are used.

In some cases a value of the block order [math]M[/math] can be chosen from other reasonings (for example, on the basis of loop unrolling technique to increase the performance efficiency of low-level procedures).

Other algorithms devoted to the Cholesky decomposition for the matrices considered in the above discussion can be developed on the basis of the above approaches for block and [[Метод_Холецкого_(нахождение_симметричного_треугольного_разложения)#Разложение Холецкого, точечный вещественный вариант для разреженной симметричной положительно-определённой матрицы|sparse] matrices.

5 The Cholesky decomposition for symmetric indefinite (saddle point) matrices

If a symmetric matrix [math]{\mathcal A}[/math] is represented as

[math] {\mathcal A} = \begin{bmatrix} A & B^T \\ B & -C \end{bmatrix}, [/math]

where [math]A[/math] is a symmetric positive definite matrix ([math]A=A^T\gt 0[/math]) and [math]C[/math] is a symmetric nonnegative definite matrix ([math]C=C^T\ge0[/math]), then the matrix [math]{\mathcal A}[/math] can be transformed to the following form by performing a single step of block pivoting:

[math] \begin{bmatrix} A & 0 \\ 0 & S \end{bmatrix}. [/math]

Here the Schur complement matrix [math]S=-(C+B^TA^{-1}B)[/math] is a strictly negative definite matrix ([math]S=S^T\lt 0[/math]). This means that the matrix [math]{\mathcal A}[/math] has [math]n_A[/math] positive and [math]n_C[/math] negative eigenvalues, where [math]n_A[/math] and [math]n_C[/math] are the orders of the matrices [math]A[/math] and [math]C[/math], respectively.

In this case there exists a symmetric triangular decomposition [math]{\mathcal A}={\mathcal L}D{\mathcal L}^T[/math] such that [math]{\mathcal L}[/math] is a lower unitriangular matrix and the diagonal matrix [math]D[/math] consists of [math]n_A[/math] positive and [math]n_C[/math] negative elements. Such a decomposition can be obtained without pivoting, even if [math]C[/math] is a zero matrix.

In the general case of a symmetric nonsingular indefinite matrix, its decomposition should be performed by diagonal pivoting, which corresponds to a symmetric permutation of the rows and columns in the original matrix [math]{\mathcal A}[/math].

6 The Cholesky decomposition for Hermitian matrices

A Hermitian matrix is a square complex matrix [math]A[/math] such that the following equalities are valid for its elements: [math]a_{ij}=\overline{a_{ji}}[/math]; here we use the complex conjugation operation defined as follows: if [math]z=a+{\rm i\,}b\,[/math] and [math]{\rm i}^2=-1[/math], then [math]\overline z=a-{\rm i\,}b\,[/math]. In the matrix form, the above equalities can be written as [math]A=\overline{A^T}[/math] or [math]A=A^*=A^Н[/math]. Thus, a Hermitian matrix is equal to its conjugate transpose. Its diagonal elements are real.

6.1 A dot version

A dot version of the Cholesky decomposition for real symmetric positive definite matrices can be generalized to the case of Hermitian positive definite matrices. The above formulas derived for the real case remain the same with the only exception that the complex operations are used instead of the corresponding operations with real numbers:

[math] \begin{align} L_{ii} & = \sqrt{ A_{ii} - \sum_{k=1}^{i-1} L_{ik}L_{ik}^* }, \\ L_{ij} & = \frac{1}{L_{jj}} \left( A_{ij} - \sum_{k=1}^{j-1} L_{ik} L_{jk}^* \right), \quad j \lt i. \end{align} [/math]

Another distinction of this version from the real version consists in the following: for the complex operations, it is required to read a double amount of data from memory and to perform about four times as many arithmetic operations, which increases the locality of computations and the total performance efficiency.

6.2 A block version

An implementation of a block version of the Cholesky decomposition for Hermitian matrices is similar to the case of the block versions for real matrices.

7 Usage of Cholesky decomposition in iterative methods

The triangular factor [math]L[/math] and a solution to a corresponding linear system may not be accurate enough because of machine arithmetic. In order to refine the solution, a number of iterative methods (for example, the conjugate gradient method) can be employed using the [math]LL^T[/math] decomposition as a preconditioner.

The memory saving is the main reason to use an incomplete or inaccurate decomposition as a preconditioner.

7.1 The fill-in reduction in the Cholesky decomposition

When the Cholesky decomposition is used for sparse matrices, a large number of nonzero elements may appear. As a result, the complete decomposition may not be stored in the main memory. In this case it is reasonable to compute an incomplete or approximate decomposition for employing it as a preconditioner.

7.2 IC([math]k[/math]): the incomplete Cholesky decomposition (or factorization) by position

An incomplete Cholesky decomposition can be obtained using a chosen strategy of dropping fill-in elements while computing the incomplete factors. More often, the incomplete decomposition is stored in the same positions where the nonzero elements of the original matrix [math]A[/math] are stored. Such a decomposition is denoted by IC(0) or IC0.

If the quality of the IC0 decomposition is not enough, then it is reasonable to choose a wider structure of the triangular factor [math]L[/math]. For example, we can allow the formation of another level of new nonzero elements compared to the original structure of the matrix [math]A[/math]. Formally, in this example the generation of fill-in elements is allowed within the sparsity pattern of the matrix [math]A^2[/math]. Such a decomposition is denoted by IC(1).

It is also reasonable to consider a more general case when the generation of fill-in elements is allowed within the sparsity pattern of the matrix [math]A^{k+1}[/math], where [math]k \geq 0[/math]. Such a decomposition is denoted by IC([math]k[/math]).

Usually, when the value of [math]k[/math] increases, the accuracy of the IC([math]k[/math]) incomplete decomposition also increases; however, this is not always the case even for symmetric positive definite matrices, although their complete decompositions exist and are unique. Since the IC([math]k[/math]) decomposition is not complete, zero or negative elements may appear on the main diagonal. In these cases the diagonal shift [math]A+\varepsilon I[/math] is preliminary performed in an original matrix [math]A[/math] before its decomposition (here [math]\varepsilon\gt 0[/math] is a small parameter and [math]I[/math] is a diagonal matrix). If, during the decomposition, a too small or nonpositive element is formed on the main diagonal, then such an element is replaced by a prescribed value. This operation is called a diagonal correction of the decomposition.

7.3 IC([math]\tau[/math]): the threshold based incomplete Cholesky decomposition

In order to control the generation of fill-in elements in the triangular factor [math]L[/math], the following dropping strategy can be used in addition to the above structural constraints: a new fill-in element whose absolute value is less than a small parameter [math]\tau\gt 0[/math] is dropped. Such a decomposition is also known as an approximate decomposition and is denoted by IC([math]\tau[/math]). The parameter [math]\tau[/math] is said to be a "threshold" of the decomposition or a filtration "threshold".

It can be expected that the accuracy of the decomposition increases with decreasing [math]\tau[/math]; however, the number of nonzero elements in the triangular factor [math]L[/math] also increases. A disadvantage of such a decomposition consists in the fact that, in the general case, the generation of fill-in elements in the facor [math]L[/math] cannot be predicted.

From the stability standpoint, an approximate Cholesky decomposition is more preferable, although the application of a preliminary diagonal shift or a diagonal correction is also allowed. If the above approaches do not lead to a sufficient accuracy of the Cholesky decomposition, then the Ajiz-Jennings approach can be used to modify the main diagonal of a matrix: if a small element [math]\ell_{ij}[/math] should be dropped, then its absolute value is added to the diagonal elements [math]\ell_{ii}[/math] and [math]\ell_{jj}[/math]. This approach ensures the existence of an approximate decomposition for any symmetric positive definite matrix [math]A[/math] and is most efficient when using the Crout version of the Cholesky decomposition.

7.4 IC([math]\tau_1,\tau_2[/math]): the second order incomplete Cholesky decomposition

In order to increase the accuracy of an approximate Cholesky decomposition, the "double-threshold" version of such a decomposition can be used. This version is known as the Tismenetsky-Kaporin decomposition. Its main idea consists in the following: the decomposition should be performed with a higher accuracy [math]\tau_2[/math], whereas only the elements whose absolute values are not less than [math]\tau_1[/math] should be stored in the triangular factor [math]L[/math]. Usually, the values of [math]\tau_1[/math] and [math]\tau_2[/math] are chosen as follows: [math]\tau_1=\tau[/math] and [math]\tau_2=\tau^2[/math]. Such a decomposition is called the "second-order" one, since the absolute values of the elements in the error matrix turn out to be less than [math]\tau^2[/math]. This symmetric threshold based incomplete Cholesky decomposition of second order is denoted by IC2 and should not be confused with the IC(2) structural decomposition, i.e., with the IC([math]k[/math]) decomposition, where [math]k=2[/math].

Such a decomposition is usually used together with the Ajiz-Jennings approach to modify the diagonal elements. Such a combination yields a "reliable" decomposition for any symmetric positive definite matrix [math]A[/math]. In addition, this version allows one to obtain the most accurate decompositions for the same fill-in structure of the factor [math]L[/math], although the computational cost increases.

7.5 A combination of the IC([math]k,\tau[/math]) and IC([math]\tau,m[/math]) incomplete Cholesky decompositions

In order to save memory space when computing an incomplete or approximate Cholesky decomposition, the following two versions of symmetric triangular decompositions can be used.

To control the upper bound of the number of nonzero elements in the triangular factor [math]L[/math], we can first use the IC([math]k[/math]) incomplete decomposition for chosen value of [math]k[/math]. A further memory saving for the resulting sparsity pattern can be achieved by using the IC([math]\tau[/math]) threshold based strategy with a prescribed value of [math]\tau[/math]. Such a combination is called the IC([math]k,\tau[/math]) decomposition. For example, this combination is reasonable to use when it is necessary to retain a prescribed sparsity pattern to minimize the data exchange in parallel implementations for distributed memory.

The second version is based on the dual-dropping strategy known as the IC([math]\tau,m[/math]) decomposition, where [math]\tau[/math] is the dropping threshold and [math]m[/math] is the number of fill-in elements allowed in any row of the factor [math]L[/math]. Thus, the IC([math]\tau[/math]) decomposition is supplemented with the following additional constraint imposed on the elements of the matrix rows: not more than the [math]m[/math] largest-in-magnitude elements are allowed to be stored in a row of the factor [math]L[/math]. As a result, the number of elements in the factor [math]L[/math] does not exceed [math]nm[/math], where [math]n[/math] is the order of the original matrix [math]A[/math]. For example, such an approach is reasonable for the matrices obtained by discretizing an equation on a sufficiently regular grid stencil. The most known nonsymmetric variant of this decomposition is proposed by Saad and is known as the ILUT decomposition.

8 Usage of the Cholesky decomposition in parallel iterative algorithms

For the most part, the formulas of the Cholesky decomposition are recursive; hence, the identification of parallel and independent stages of computation is not obvious and simple. A straightforward approach may lead to excessive amount of transferred data, which considerably reduces the performance efficiency. The largest effect can be obtained on the basis of a preliminary reordering of the original matrix.

8.1 Block reordering techniques

In order to identify independent computational blocks, symmetric permutations of rows and columns in the original matrix can be performed to represent it in a block bordered form. In this case the most part of computations is concentrated in independent blocks that can be processed in parallel without data exchange.

The reverse Cuthill—McKee (RCM) ordering algorithm is the most simple (but not very efficient) method for reordering a sparse matrix to minimize its profile width, followed by a uniform block (processor) partitioning in the reordered matrix. The number of a processor is assigned to each vertex of the matrix’s graph; after that, the vertices associated only with the vertices with the same numbers of processors (such vertices are considered as internal) are gathered into independent blocks. The remaining vertices are combined in the bordered block to be processed individually. Inside the independent blocks, the arithmetic operations are independent and can be performed in parallel. The internal vertices of each block can be ordered by the RCM algorithm to enhance the efficiency of the triangular decomposition.

The following more efficient methods can be used to minimize the width of bordering.

  • The method of minimal separators is based on the sequential search for the separators of minimum size providing the splitting of the remaining vertices into two independent blocks.
  • The minimum degree (MD) method. A direct use of this method for large matrices is difficult; therefore, the approximate minimum degree (AMD) method is employed.
  • The nested dissection (ND) method. This is a recursive algorithm dividing the set of vertices into two independent blocks represented in the [math]2\times2[/math] block bordered form.

Another positive effect of such an ordering consists in the following: for a certain class of matrices, the complete decomposition has an almost minimally possible number of nonzero elements. Generally, the search for an optimal permutation is an NP-complete problem.

There exist other matrix ordering algorithms for the most optimal distribution of matrices among processors. The most known are the sequential-mode packages METIS, JOSTLE, SCOTCH, CHACO, and PARTY, as well as the parallel-mode packages PARMETIS, JOSTLE, PT-SCOTCH, and ZOLTAN. Many of them are in open access.

8.2 Decomposition within independent blocks

In independent blocks, the arithmetic operations are completely independent and can be performed in parallel without data exchange among the processors in use. The only disadvantage consists in the fact that, for the threshold based decompositions, the number of arithmetic operations may be different for different processors, which may lead to unbalanced computations.

8.3 Decomposition within separators

The last block containing the separators of all blocks can be processed in a head processor if the number of the processors in use is small; otherwise, the separators should processed in parallel.

8.4 Hierarchical and nested algorithms

The above algorithms for the matrix ordering and for the identification of independent blocks can be used for the separators. The same approach can be employed at the next step with developing a hierarchical or nested algorithm.

In the case of the threshold based decomposition, the above approaches can be applied to the Schur complement. In the case of the structural decompositions, there exist algorithms for multilevel ordering with the minimum number of fill-in elements and data exchanges.

8.5 The block Jacobi method

The approaches described above belong to the class of "direct" (or "explicit") decompositions when the reduction of fill-ins during the factorization is based only on the absolute values of fill-in elements. In these approaches, the structural properties are of minor importance and do not influence the dropping of elements.

An alternative approach is based on the following strategy: the structural properties are the only reason for dropping the fill-in elements to enhance the parallelism resources. For example, it is reasonable to firstly distribute the matrix rows among the processors and, before the decomposition, to drop all the elements that connect a processor with the others. I this case the decomposition can be performed independently within each of the blocks. Any one of the structural or threshold based incomplete versions of the Cholesky decomposition can be used inside each block. The formation of such a preconditioner is said to be the non-overlapping block Jacobi method. Such a preconditioning is most simple and can be performed in parallel with no data exchange; however, the convergence of this process is not very high and is almost independent of the quality of decomposition inside each of the blocks.

8.6 The additive Schwarz method

Compared to the Jacobi method, the additive Schwarz method provides the decompositions of much higher quality. This method is also known as the overlapping block Jacobi method. Its essence consists in the following: the structure of each of the matrix blocks is expanded by the addition of several layers of adjacent rows. The triangular decomposition is made for the expanded matrix. As a result, the expanded subproblem is solved on each of the processors using the data from other processors. When a subproblem is solved, the nonlocal components of the solution are dropped on each processor. Such a version of the method is known as the restricted additive Schwarz method.

The convergence of the additive Schwarz method is usually much higher than that of the Jacobi method and is monotonically increasing with an increase in the overlapping size. Although this method requires additional operations and data exchanges, its computational cost may be considerably less on parallel computers.

8.7 Incomplete inverse triangular decompositions

Another version of the additive decomposition is based on the backward overlapping of the blocks in the direction of lesser numbers of rows and provides a slightly higher rate of convergence. This version is called the block incomplete inverse Cholesky (abbreviated as BIIC) and is also known as the incomplete inverse triangular decomposition. Note that it can be used for nonsymmetric matrices.

Within each of the blocks, this method can be combined with the second-order incomplete symmetric triangular decomposition (known as the IC2 decomposition). This combination is called the block incomplete inverse Cholesky decomposition of second order and is abbreviated as BIIC2.

The idea of this method was first proposed by I.E. Kaporin without consideration of its parallelization. Its parallel implementation is known as the Kaporin-Konshin method.

9 Solving triangular linear systems

The Cholesky decomposition can be used to solve linear systems [math]Ax = b[/math], where [math]A[/math] is a symmetric positive definite matrix. First the matrix [math]A[/math] is decomposed as [math]A = LL^T[/math] and, then, the two triangular linear systems [math]Ly = b[/math] and [math]L^T x = y[/math] are solved to obtain the unknown vector [math]x [/math].

9.1 Solving linear systems with dense lower triangular matrices

A linear system [math]L y = b[/math], where [math]L[/math] is a dense lower triangular matrix, can be solved by forward substitution (i.e., starting with the upper left-hand corner of the matrix [math]L[/math] in the order of increasing the row numbers [math]i[/math]) as follows:

[math] \begin{align} y_{1} & = b_{1}, \\ y_{i} & = b_{i} - \sum_{j = 1}^{i-1} \ell_{ij} y_{j}, \quad i = 2,...,n. \end{align} [/math]

A detailed description of this algorithm and its analysis can be found in Forward_substitution_(a real_version).

9.2 Solving linear systems with dense upper triangular matrices

A linear system [math]U x = y[/math], where [math]U[/math] is a dense upper triangular matrix (for example, [math]U=L^T[/math]), can be solved by backward substitution (i.e., starting with the lower right-hand corner of the matrix [math]L[/math] in the order of decreasing the row numbers [math]i[/math]) as follows:

[math] \begin{align} x_{n} & = y_{n}/u_{nn}, \\ x_{i} & = \left (y_{i} - \sum_{j = i+1}^{n} u_{ij} x_{j} \right ) / u_{ii}, \quad i = n - 1,...,1. \end{align} [/math]

A detailed description of this algorithm and its analysis can be found in Backward_substitution_(a real_version).

9.3 Solving linear systems with dense upper triangular matrices

A linear system [math]U x = y[/math], where [math]U[/math] is a dense upper triangular matrix (for example, [math]U=L^T[/math]), can be solved by backward substitution (i.e., starting with the lower right-hand corner of the matrix [math]L[/math] in the order of decreasing the row numbers [math]i[/math]) as follows:

[math] \begin{align} x_{n} & = y_{n}/u_{nn}, \\ x_{i} & = \left (y_{i} - \sum_{j = i+1}^{n} u_{ij} x_{j} \right ) / u_{ii}, \quad i = n - 1,...,1. \end{align} [/math]

A detailed description of this algorithm and its analysis can be found in Backward_substitution_(a real_version).

9.4 Solving linear systems with sparse lower triangular matrices

Linear systems with sparse lower (upper) triangular matrices are solved in a similar manner as is done in the case of dense matrices. The forward and backward substitutions are performed using only nonzero elements on the basis of approaches for sparse matrices.

9.5 Solving linear systems with complex triangular matrices

Linear systems with complex lower (upper) triangular matrices are solved in a similar manner as is done in the case of real matrices. The arithmetic operations are performed according to the complex arithmetic rules as is done for the factorization of Hermitian matrices.

9.6 Solving linear systems with block bordered triangular matrices

The following peculiarity should be mentioned: the linear systems with block bordered triangular matrices can be solved in parallel, since, in each of the blocks, the arithmetic operations are independent of the other blocks.

10 Existing implementations of the Cholesky algorithm

  • In LAPACK the function DPBTRF is used (a sequential implementation in double precision).
  • In ScaLAPACK the function PDPBTRF is used (a parallel implementation in double precision).
  • In SAS the ROOT( matrix ) function is used; this function is included in the SAS IML package.
  • In the systems MATLAB, Octave, R, the decomposition is performed using the command U = chol(A).
  • In Maple и NumPy there exists a Cholesky procedure in the linalg module.
  • In Mathematica the CholeskyDecomposition[A] procedure is used.
  • In GSL the gsl_linalg_cholesky_decomp function is used.
  • In ALGLIB the Cholesky decomposition implementations LLT and LDLT are available for the programming languages C#, C++, C++ (extended precision arithmetic), FreePascal, Delphi, VB.NET, VBA, and Python.
  • In Online Matrix Calculator the Cholesky decomposition can be performed using the web-interface (see the Section Cholesky Decomposition).
  • In OpenFOAM Diagonal Incomplete Cholesky used as preconditioner.