# Stone doubling algorithm for the LU decomposition of a tridiagonal matrix

 Stone doubling algorithm for the LU decomposition of tridiagonal matrices Sequential algorithm Serial complexity $3(n-1)\lceil \log_2 (n-1) \rceil + o((n-1)\lceil \log_2 (n-1) \rceil)$ Input data $3n-2$ Output data $2n-1$ Parallel algorithm Parallel form height $2 \lceil \log_2 (n-1) \rceil + 3$ Parallel form width $n$

Main author: Alexey Frolov

## 1 Properties and structure of the algorithm

### 1.1 General description of the algorithm

The Stone doubling algorithm for the LU decomposition of tri-diagonal matrices is a part of the Stone doubling algorithm for solving SLAEs [1][2] of the form $Ax = b$, where

$A = \begin{bmatrix} a_{11} & a_{12} & 0 & \cdots & \cdots & 0 \\ a_{21} & a_{22} & a_{23}& \cdots & \cdots & 0 \\ 0 & a_{32} & a_{33} & \cdots & \cdots & 0 \\ \vdots & \vdots & \ddots & \ddots & \ddots & 0 \\ 0 & \cdots & \cdots & a_{n-1 n-2} & a_{n-1 n-1} & a_{n-1 n} \\ 0 & \cdots & \cdots & 0 & a_{n n-1} & a_{n n} \\ \end{bmatrix}, x = \begin{bmatrix} x_{1} \\ x_{2} \\ \vdots \\ x_{n} \\ \end{bmatrix}, b = \begin{bmatrix} b_{1} \\ b_{2} \\ \vdots \\ b_{n} \\ \end{bmatrix}$

The Stone doubling algorithm was first proposed in early 70's of 20-th century [3] as an alternative to other parallel algorithms for solving tri-diagonal SLAEs such as, for instance, the cyclic reduction method.

Here, we consider the first part of the algorithm that concerns the $LU$ decomposition. The aim is to represent the matrix

$A = \begin{bmatrix} a_{11} & a_{12} & 0 & \cdots & \cdots & 0 \\ a_{21} & a_{22} & a_{23}& \cdots & \cdots & 0 \\ 0 & a_{32} & a_{33} & \cdots & \cdots & 0 \\ \vdots & \vdots & \ddots & \ddots & \ddots & 0 \\ 0 & \cdots & \cdots & a_{n-1 n-2} & a_{n-1 n-1} & a_{n-1 n} \\ 0 & \cdots & \cdots & 0 & a_{n n-1} & a_{n n} \\ \end{bmatrix}$

as the product of the matrices

$L = \begin{bmatrix} 1 & 0 & 0 & \cdots & \cdots & 0 \\ l_{21} & 1 & 0 & \cdots & \cdots & 0 \\ 0 & l_{32} & 1 & \cdots & \cdots & 0 \\ \vdots & \vdots & \ddots & \ddots & \ddots & 0 \\ 0 & \cdots & \cdots & l_{n-1 n-2} & 1 & 0 \\ 0 & \cdots & \cdots & 0 & l_{n n-1} & 1 \\ \end{bmatrix}$

and

$U = \begin{bmatrix} u_{11} & u_{12} & 0 & \cdots & \cdots & 0 \\ 0 & u_{22} & u_{23}& \cdots & \cdots & 0 \\ 0 & 0 & u_{33} & \cdots & \cdots & 0 \\ \vdots & \vdots & \ddots & \ddots & \ddots & 0 \\ 0 & \cdots & \cdots & 0 & u_{n-1 n-1} & u_{n-1 n} \\ 0 & \cdots & \cdots & 0 & 0 & u_{n n} \\ \end{bmatrix}$

An important point is the fact that, in exact arithmetic, the Stone doubling algorithm finds the same decomposition as that in the compact scheme of Gaussian elimination.

Theoretically, the Stone method is based on the following considerations. Define the values $q_i$ by the formulas $q_0 = 1$, $q_1 = a_{11}$, and $q_i = a_{ii} q_{i-1} - a_{ii-1}a_{i-1i} q_{i-2}$. As soon as all the $q_i$ are calculated, one can easily find all the entries $u_{ii} = q_i/q_{i-1}$ and $l_{ii-1} = a_{ii-1}/u_{i-1i-1}$. The entries $u_{ii+1} = a_{ii+1}$ need not be calculated.

When calculating the values $q_i$, one uses the associativity of matrix multiplication. The recursion relation $q_i = a_{ii} q_{i-1} - a_{ii-1}a_{i-1i} q_{i-2}$ implies the matrix equality

$\begin{bmatrix} q_i \\ q_{i-1} \\ \end{bmatrix} = \begin{bmatrix} a_{ii} & -a_{ii-1}a_{i-1i} \\ 1 & 0 \\ \end{bmatrix} \begin{bmatrix} q_{i-1} \\ q_{i-2} \\ \end{bmatrix} = T_i \begin{bmatrix} q_{i-1} \\ q_{i-2} \\ \end{bmatrix} = T_i T_{i-1}...T_2 \begin{bmatrix} q_{1} \\ q_{0} \\ \end{bmatrix} = T_i T_{i-1}...T_2 \begin{bmatrix} a_{11} \\ 1 \\ \end{bmatrix}$

Matrix products $T_i T_{i-1}...T_2$ are calculated using the parallel doubling scheme. Note that the second row of each matrix $T_i$ is identical to the first row of the identity matrix. It follows that the second row of the product $T_i T_{i-1}...T_k$ is identical to the first row of the product $T_{i-1}...T_k$, which makes it possible to halve the number of arithmetic operations compared to the analogous product of general matrices of the same order .

Figure 1. Graph of calculating the product of the matrices $T_{i}$ for $n=9$. Each node at the first layer corresponds to a composite operation consisting of a single multiplication and a single operation of the type $a+bc$, while each node at other layers corresponds to two operations of the type $ab+cd$. Results calculated by black nodes are used at later stages, and intermediate results are calculated by light nodes

### 1.2 Mathematical description of the algorithm

First, for all $i$ from 2 to $n$, we set

$p_i = a_{ii}, r_i = -a_{ii-1}a_{i-1i}$,

and

$p_1 = 0, r_1 = 1$.

At step $k$, the calculations are conducted as follows:

Beginning from $i$ = 2, skip $2^{k-1}$ numbers; for the next $2^{k-1}$ numbers, set $j$ equal to the nearest smaller integer of the form $j = 1 + 2^{k} m$.

At the first step, the values $p_i, r_i$ are updated by the formulas $p^{new}_i = p_i p_{i-1}, r^{new}_i = p_i r_{i-1} + r_i$; at the other steps, they are updated using the formulas $p^{new}_i = p_i p_{j}+r_i p_{j-1}, r^{new}_i = p_i r_{j} + r_i r_{j-1}$.

Then, again, $2^{k-1}$ numbers are skipped, and, for the next $2^{k-1}$ numbers, calculations similar to those above are performed, and so on until the range of $i$ from 2 to $n$ is run out.

The above steps are repeated until it occurs that all the values of $i$ from 2 to $n$ should be skipped. After that, one calculates $q_i = p_i a_{11} + r_i$ for all $i$ from 1 to $n$, then $u_{ii} = q_i/q_{i-1}$ and $l_{ii-1} = a_{ii-1}/u_{i-1i-1}$ for all $i$ from 2 to $n$.

### 1.3 Computational kernel of the algorithm

The computational kernel of the algorithm is mainly composed of operations of the type $ab+cd$ and $ab+c$ with a small addition of multiplications and divisions. There are no isolated long sequences of calculations.

### 1.4 Macro structure of the algorithm

At the macro-level, the following macro-operations can be set out: 1. Calculation of the entries of the matrices $T_i$. 2. Calculation of the products of the matrices $T_{i}$ using the doubling scheme. 3. Calculation of the ultimate results.

### 1.5 Implementation scheme of the serial algorithm

The Stone method was from the outset designed for a parallel implementation because it involves redundant calculations compared to, say, the classical Thomas algorithm. A serial implementation of the method makes no sense also for the reason that it is numerically unstable.

### 1.6 Serial complexity of the algorithm

For a full execution of the Stone algorithm resulting in the decomposition of the given tri-diagonal matrix into the product of two bi-diagonal matrices, one should perform

• $2n-2$ divisions,
• $2(n-1)\lceil \log_2 (n-1) \rceil + o((n-1)\lceil \log_2 (n-1) \rceil)$ multiplications, and
• $(n-1)\lceil \log_2 (n-1) \rceil +o((n-1)\lceil \log_2 (n-1) \rceil)$ additions.

Consequently, in terms of serial complexity, the algorithm should be qualified as a linear-logarithmic complexity algorithm.

### 1.7 Information graph

As already noted, the macro-structure of the algorithm consists of three parts.

In the first part, off-diagonal entries of the matrices $T_{i}$ are calculated.

In the second part, products of the matrices $T_{i}$ are calculated using the doubling technique. This calculation is shown in Figure 1; it uses the results of the first part.

In the concluding third part, the final results are calculated with the use of the results of the second part.

The inner graphs of the first and third parts are empty; their operations are only related to the second part.

### 1.8 Parallelization resource of the algorithm

The critical path of the Stone algorithm for the LU decomposition of tri-diagonal matrices consists of

• $1$ layer of divisions,
• $\lceil \log_2 (n-1) \rceil+1$ layers of multiplications, and
• $\lceil \log_2 (n-1) \rceil+1$ layers of additions.

Consequently, in terms of the parallel form height, the algorithm should be qualified as a logarithmic complexity algorithm. The width of a layer is $n$; therefore, in terms of the parallel form width, the algorithm should be qualified as a linear complexity algorithm.

### 1.9 Input and output data of the algorithm

Input data: tri-diagonal matrix $A$ (with entries $a_{ij}$).

Size of the input data: $3n-2$. In different implementations, information can be stored differently for economy reasons. For instance, each diagonal of the matrix can be stored by a row of an array.

Output data: lower bi-diagonal matrix $L$ (with entries $l_{ij}$, where $l_{ii}=1$) and upper bi-diagonal matrix $U$ (with entries $u_{ij}$).

Size of the output data is formally $3n-2$. However, since $n$ items in the input and output data are identical, actually only $2n-2$ items are calculated.

### 1.10 Properties of the algorithm

It is clearly seen that the ratio of the serial to parallel complexity is $O(n)$.

The computational power of the algorithm, understood as the ratio of the number of operations to the total size of the input and output data, is logarithmic.

The algorithm, within the chosen version, is completely determined.

The algorithm is numerically stable only if the leading principal minors of the matrix do not grow. If they can grow significantly, then the stability region of the algorithm is very narrow.

## 2 Software implementation of the algorithm

### 2.1 Implementation peculiarities of the serial algorithm

Due to its great redundancy in calculations, the Stone doubling algorithm never was targeted on a serial implementation. After its numerical instability had been detected, it became clear that, in the future, it also will not be implemented on serial architectures.

### 2.2 Locality of data and computations

The numerical instability of the algorithm makes senseless any analysis of locality. Nevertheless, the algorithm graph shows the presence of a number of arcs that are long both in time (meaning the difference between the indices of layers to which the endpoints of an arc belong) and in space (excluding the allocation in a hypercube, which is physically impossible). This inherent non-locality must hamper the execution of the algorithm. An actual analysis of memory accesses for a serial code makes no sense because such a code is not used now and will not be used by anybody in the future.

### 2.3 Possible methods and considerations for parallel implementation of the algorithm

Because of numerical instability, any implementations of the algorithm make no sense.

### 2.4 Scalability of the algorithm and its implementations

Because of the numerical instability of the algorithm, any discussion of the scalability of its implementations makes no sense.

### 2.5 Dynamic characteristics and efficiency of the algorithm implementation

The numerical instability of the algorithm makes senseless any measurements of the efficiency and dynamic characteristics of its implementations. Moreover, the presence of redundant calculations significantly limits the actual efficiency even for small size problems, which are not yet affected by instability.

### 2.6 Conclusions for different classes of computer architecture

Even in the range of reasonable round-off errors (small size problems), an efficient implementation of the method is impossible. A faster approach would be to solve a SLAE on a common PC using the Thomas algorithm.

### 2.7 Existing implementations of the algorithm

Due to its numerical instability, the algorithm is not used in practice. According to the original publication [3], it was designed to replace the more popular cyclic reduction. However, on that score, the algorithm has failed.

## 3 References

1. Voevodin V.V. Computational foundations of linear algebra Moscow: Nauka, 1977.
2. Voevodin V.V., Kuznetsov Yu.A. Matrices and computations, Moscow: Nauka, 1984.
3. Stone H.S. An Efficient Parallel Algorithm for the Solution of a Tridiagonal Linear System of Equations // J. ACM, Vol. 20, No. 1 (Jan. 1973), P. 27-38.