Algorithm level

Difference between revisions of "Stone doubling algorithm for the LU decomposition of a tridiagonal matrix"

From Algowiki
Jump to navigation Jump to search
[unchecked revision][checked revision]
 
(29 intermediate revisions by 2 users not shown)
Line 8: Line 8:
 
}}
 
}}
  
Main authors: [[:ru:Участник:Frolov|Alexey Frolov]]
+
Main author: [[:ru:Участник:Frolov|Alexey Frolov]]
  
 
== Properties and structure of the algorithm ==
 
== Properties and structure of the algorithm ==
Line 14: Line 14:
 
=== General description of the algorithm ===
 
=== General description of the algorithm ===
  
The '''Stone doubling algorithm for the LU decomposition of tri-diagonal matrices''' is a part of the [[Метод сдваивания Стоуна|Stone doubling algorithm]] for solving SLAEs <ref name="VOLA">Воеводин В.В. Вычислительные основы линейной алгебры. М.: Наука, 1977.</ref><ref name="MIV">Воеводин В.В., Кузнецов Ю.А. Матрицы и вычисления. М.: Наука, 1984.</ref> of the form <math>Ax = b</math>, where  
+
The '''Stone doubling algorithm for the LU decomposition of tri-diagonal matrices''' is a part of the [[Stone doubling algorithm]] for solving SLAEs <ref name="VOLA">Voevodin V.V. Computational foundations of linear algebra Moscow: Nauka, 1977.</ref><ref name="MIV">Voevodin V.V., Kuznetsov Yu.A. Matrices and computations, Moscow: Nauka, 1984.</ref> of the form <math>Ax = b</math>, where  
  
 
{{Template:Standard tridiagonal SLAE}}
 
{{Template:Standard tridiagonal SLAE}}
  
The [[Stone doubling algorithm]] was first proposed in early 70's of 20-th century <ref name="STONE">Stone H.S. An Efficient Parallel Algorithm for the Solution of a Tridiagonal Linear System of Equations // J. ACM, Vol. 20, No. 1 (Jan. 1973), P. 27-38.</ref> as an alternative to other parallel качестве альтернативы другим параллельным алгоритмам решения трёхдиагональных СЛАУ, например, [[Полный метод циклической редукции|методу циклической редукции]].  
+
The [[Stone doubling algorithm]] was first proposed in early 70's of 20-th century <ref name="STONE">Stone H.S. An Efficient Parallel Algorithm for the Solution of a Tridiagonal Linear System of Equations // J. ACM, Vol. 20, No. 1 (Jan. 1973), P. 27-38.</ref> as an alternative to other parallel algorithms for solving tri-diagonal SLAEs such as, for instance, the [[Complete cyclic reduction|cyclic reduction method]].  
  
Здесь рассматривается его первая часть - <math>LU</math>-разложение. Оно состоит в представлении матрицы
+
Here, we consider the first part of the algorithm that concerns the <math>LU</math> decomposition. The aim is to represent the matrix
  
 
{{Template:Standard tridiagonal matrix}}
 
{{Template:Standard tridiagonal matrix}}
  
в виде произведения матриц
+
as the product of the matrices
  
 
{{Template:L2dUniDiag}}
 
{{Template:L2dUniDiag}}
  
и
+
and 
  
 
{{Template:U2dCommon}}
 
{{Template:U2dCommon}}
  
Важным моментом является то, что в условиях точных вычислений алгоритм сдваивания Стоуна вычисляет то же самое разложение, что и [[Компактная схема метода Гаусса для трёхдиагональной матрицы, последовательный вариант|компактная схема метода Гаусса]].
+
An important point is the fact that, in exact arithmetic, the Stone doubling algorithm finds the same decomposition as that in the [[Gaussian elimination, compact scheme for tridiagonal matrices, serial version|compact scheme of Gaussian elimination]].
  
Теоретически метод Стоуна основан на том, что, если ввести величины <math>q_i</math> так, что <math>q_0 = 1</math>, <math>q_1 = a_{11}</math> и <math>q_i = a_{ii} q_{i-1} - a_{ii-1}a_{i-1i} q_{i-2}</math>, то после вычисления всех <math>q_i</math> легко вычислить все <math>u_{ii} = q_i/q_{i-1}</math>, <math>l_{ii-1} = a_{ii-1}/u_{i-1i-1}</math>. При этом не нужно вычислять <math>u_{ii+1} = a_{ii+1}</math>.  
+
Theoretically, the Stone method is based on the following considerations. Define the values <math>q_i</math> by the formulas <math>q_0 = 1</math>, <math>q_1 = a_{11}</math>, and <math>q_i = a_{ii} q_{i-1} - a_{ii-1}a_{i-1i} q_{i-2}</math>. As soon as all the <math>q_i</math> are calculated, one can easily find all the entries <math>u_{ii} = q_i/q_{i-1}</math> and <math>l_{ii-1} = a_{ii-1}/u_{i-1i-1}</math>. The entries <math>u_{ii+1} = a_{ii+1}</math> need not be calculated.  
  
Вычисление всех <math>q_i</math> производится с использованием ассоциативности матричного умножения. Из рекуррентной связи <math>q_i = a_{ii} q_{i-1} - a_{ii-1}a_{i-1i} q_{i-2}</math> следует  матричное равенство
+
When calculating the values <math>q_i</math>, one uses the associativity of matrix multiplication. The recursion relation <math>q_i = a_{ii} q_{i-1} - a_{ii-1}a_{i-1i} q_{i-2}</math> implies the matrix equality
  
 
:<math>
 
:<math>
Line 60: Line 60:
 
</math>
 
</math>
  
Произведения матриц <math>T_i T_{i-1}...T_2</math> вычисляются параллельной схемой сдваивания. При этом благодаря тому, что вторая строка всех матриц <math>T_i</math> равна первой строке единичной матрицы, вторая строка произведения <math>T_i T_{i-1}...T_k</math> совпадает с первой строкой произведения
+
Matrix products <math>T_i T_{i-1}...T_2</math> are calculated using the parallel doubling scheme. Note that the second row of each matrix <math>T_i</math> is identical to the first row of the identity matrix. It follows that the second row of the product <math>T_i T_{i-1}...T_k</math> is identical to the first row of the product
<math>T_{i-1}...T_k</math>, что позволяет сэкономить вычисления вдвое против перемножения неособенных матриц этого же порядка.
+
<math>T_{i-1}...T_k</math>, which makes it possible to halve the number of arithmetic operations compared to the analogous product of general matrices of the same order .
  
[[file:DoublingPartUniver.png|thumb|right|600px|Рисунок 1. Граф вычисления произведений матриц <math>T_{i}</math> при <math>n=9</math>. Каждая вершина на первом ярусе соответствует составной операции, состоящей из одного умножения и одной операции <math>a+bc</math>, на остальных ярусах - из двух операций <math>ab+cd</math>. В чёрных вершинах вычисляются используемые далее результаты, в светлых - промежуточные]]
+
[[file:DoublingPartUniver.png|thumb|right|600px|Figure 1. Graph of calculating the product of the matrices <math>T_{i}</math> for <math>n=9</math>. Each node at the first layer corresponds to a composite operation consisting of a single multiplication and a single operation of the type <math>a+bc</math>, while each node at other layers corresponds to two operations of the type <math>ab+cd</math>. Results calculated by black nodes are used at later stages, and intermediate results are calculated by light nodes]]
  
 
=== Mathematical description of the algorithm ===
 
=== Mathematical description of the algorithm ===
  
Вначале полагается для всех <math>i</math> от 2 до <math>n</math>
+
First, for all <math>i</math> from 2 to <math>n</math>, we set
  
 
<math>p_i = a_{ii}, r_i = -a_{ii-1}a_{i-1i}</math>,  
 
<math>p_i = a_{ii}, r_i = -a_{ii-1}a_{i-1i}</math>,  
  
а также
+
and
  
 
<math>p_1 = 0, r_1 = 1</math>.
 
<math>p_1 = 0, r_1 = 1</math>.
  
Затем вычисления на каждом шаге (номер шага обозначим <math>k</math>) ведутся так:  
+
At step <math>k</math>, the calculations are conducted as follows:  
  
начиная с <math>i</math> с 2, пропускают <math>2^{k-1}</math> чисел, а потом для <math>2^{k-1}</math> чисел
+
Beginning from <math>i</math> = 2, skip <math>2^{k-1}</math> numbers; for the next <math>2^{k-1}</math> numbers, set <math>j</math> equal to the nearest smaller integer of the form <math>j = 1 + 2^{k} m </math>.  
берут ближайшее для них снизу <math>j</math> вида <math>j = 1 + 2^{k} m </math>.  
 
  
Для первого шага перевычисляются новые значения  <math>p_i, r_i</math> по формулам:
+
At the first step, the values <math>p_i, r_i</math> are updated by the formulas
<math>p^{new}_i = p_i p_{i-1}, r^{new}_i = p_i r_{i-1} + r_i</math>,  
+
<math>p^{new}_i = p_i p_{i-1}, r^{new}_i = p_i r_{i-1} + r_i</math>; at the other steps, they are updated using the formulas <math>p^{new}_i = p_i p_{j}+r_i p_{j-1}, r^{new}_i = p_i r_{j} + r_i r_{j-1}</math>.  
а для остальных шагов - по формулам <math>p^{new}_i = p_i p_{j}+r_i p_{j-1}, r^{new}_i = p_i r_{j} + r_i r_{j-1}</math>.  
 
  
Затем опять пропускают <math>2^{k-1}</math> чисел, повторяют для <math>2^{k-1}</math> чисел такие же вычисления, и так до исчерпания диапазона всего значений <math>i</math> от 2 до <math>n</math>.  
+
Then, again, <math>2^{k-1}</math> numbers are skipped, and, for the next <math>2^{k-1}</math> numbers, calculations similar to those above are performed, and so on until the range of <math>i</math> from 2 to <math>n</math> is run out.  
  
Шаги повторяют до тех пор, пока не оказывается, что нужно пропустить все значения <math>i</math> от 2 до <math>n</math>. После этого выполняют вычисления
+
The above steps are repeated until it occurs that all the values of <math>i</math> from 2 to <math>n</math> should be skipped. After that, one calculates
 
+
<math>q_i = p_i a_{11} + r_i</math> for all <math>i</math> from 1 to <math>n</math>, then <math>u_{ii} = q_i/q_{i-1}</math> and <math>l_{ii-1} = a_{ii-1}/u_{i-1i-1}</math> for all <math>i</math> from 2 to <math>n</math>.
<math>q_i = p_i a_{11} + r_i</math> для всех <math>i</math> от 1 до <math>n</math> и затем <math>u_{ii} = q_i/q_{i-1}</math>, <math>l_{ii-1} = a_{ii-1}/u_{i-1i-1}</math> для всех <math>i</math> от 2 до <math>n</math>.
 
  
 
=== Computational kernel of the algorithm ===
 
=== Computational kernel of the algorithm ===
  
Вычислительное ядро алгоритма составляют в основной части операции типа <math>ab+cd</math> и <math>ab+c</math>, с небольшой добавкой отдельных умножений и делений. Изолированные длинные ветви вычислений отсутствуют.
+
The computational kernel of the algorithm is mainly composed of operations of the type <math>ab+cd</math> and <math>ab+c</math> with a small addition of multiplications and divisions. There are no isolated long sequences of calculations.
  
 
=== Macro structure of the algorithm ===
 
=== Macro structure of the algorithm ===
  
На макроуровне можно выделить такие 4 макрооперации: вычисление элементов матриц <math>T_i</math>, вычисление сдваиванием произведений матриц <math>T_{i}</math>, вычисление результатов.
+
At the macro-level, the following macro-operations can be set out: 1. Calculation of the entries of the matrices <math>T_i</math>. 2. Calculation of the products of the matrices  <math>T_{i}</math> using the doubling scheme. 3. Calculation of the ultimate results.
  
 
=== Implementation scheme of the serial algorithm ===
 
=== Implementation scheme of the serial algorithm ===
  
Метод Стоуна изначально спроектирован для параллельного исполнения, поскольку является по отношению к, например, классической прогонке, алгоритмом с избыточными вычислениями. Смысла в его последовательной реализации нет ещё и из-за того, что он неустойчив.
+
The Stone method was from the outset designed for a parallel implementation because it involves redundant calculations compared to, say, the classical Thomas algorithm. A serial implementation of the method makes no sense also for the reason that it is numerically unstable.
  
 
=== Serial complexity of the algorithm ===
 
=== Serial complexity of the algorithm ===
  
Для полного выполнения алгоритма Стоуна и получения разложения трёхдиагональной матрицы на две двудиагональные нужно выполнить:
+
For a full execution of the Stone algorithm resulting in the decomposition of the given tri-diagonal matrix into the product of two bi-diagonal matrices, one should perform
  
* <math>2n-2</math> делений,  
+
* <math>2n-2</math> divisions,  
* <math>2(n-1)\lceil \log_2 (n-1) \rceil + o((n-1)\lceil \log_2 (n-1) \rceil)</math> умножений,
+
* <math>2(n-1)\lceil \log_2 (n-1) \rceil + o((n-1)\lceil \log_2 (n-1) \rceil)</math> multiplications, and
* <math>(n-1)\lceil \log_2 (n-1) \rceil +o((n-1)\lceil \log_2 (n-1) \rceil)</math> сложений.  
+
* <math>(n-1)\lceil \log_2 (n-1) \rceil +o((n-1)\lceil \log_2 (n-1) \rceil)</math> additions.  
  
Поэтому алгоритм должен быть отнесён к алгоритмам ''линейно-логарифмической сложности'' по количеству последовательных операций.
+
Consequently, in terms of serial complexity, the algorithm should be qualified as a ''linear-logarithmic complexity'' algorithm.
  
 
=== Information graph ===
 
=== Information graph ===
  
Как уже отмечено, макроструктура алгоритма состоит из 4 частей.
+
As already noted, the macro-structure of the algorithm consists of three parts.
  
В первой части производится вычисление недиагональных элементов матриц <math>T_{i}</math>.   
+
In the first part, off-diagonal entries of the matrices <math>T_{i}</math> are calculated.   
  
Во второй части - вычисление сдваиванием произведений матриц <math>T_{i}</math>, оно показано на рисунке 1 и использует результаты 1й части.  
+
In the second part, products of the matrices <math>T_{i}</math> are calculated using the doubling technique. This calculation is shown in Figure 1; it uses the results of the first part.
  
В третьей, последней, части - вычисление результатов, с использованием результатов 2й части.  
+
In the concluding third part, the final results are calculated with the use of the results of the second part.
  
Внутренние графы 1й и 3й частей - пусты, их операции в зависимости только со 2й частью.
+
The inner graphs of the first and third parts are empty; their operations are only related to the second part.
  
 
=== Parallelization resource of the algorithm ===
 
=== Parallelization resource of the algorithm ===
  
На критическом пути алгоритма Стоуна для LU-разложения трёхдиагональной матрицы нужно выполнить:
+
The critical path of the Stone algorithm for the LU decomposition of tri-diagonal matrices consists of
 +
* <math>1</math> layer of divisions,
 +
* <math>\lceil \log_2 (n-1) \rceil+1</math> layers of multiplications, and
 +
* <math>\lceil \log_2 (n-1) \rceil+1</math> layers of additions.
  
* <math>1</math> ярус делений,  
+
Consequently, in terms of the parallel form height, the algorithm should be qualified as a ''logarithmic complexity'' algorithm. The width of a layer is <math>n</math>; therefore, in terms of the parallel form width, the algorithm should be qualified as a ''linear complexity'' algorithm.
* <math>\lceil \log_2 (n-1) \rceil+1</math> ярусов умножений,
 
* <math>\lceil \log_2 (n-1) \rceil+1</math> ярусов сложений.
 
 
 
Поэтому алгоритм должен быть отнесён к алгоритмам ''логарифмической сложности'' по количеству последовательных операций. Ширина яруса равна <math>n</math>, поэтому алгоритм должен быть отнесён к алгоритмам ''линейной сложности'' по ширине ярусов.
 
  
 
=== Input and output data of the algorithm ===
 
=== Input and output data of the algorithm ===
  
'''Входные данные''': трёхдиагональная матрица <math>A</math> (элементы <math>a_{ij}</math>).
+
'''Input data''': tri-diagonal matrix <math>A</math> (with entries <math>a_{ij}</math>).
  
'''Объём входных данных''': <math>3n-2</math>. В разных реализациях размещение для экономии хранения может быть выполнено разным образом. Например, каждая диагональ может быть строкой массива.  
+
'''Size of the input data''': <math>3n-2</math>. In different implementations, information can be stored differently for economy reasons. For instance, each diagonal of the matrix can be stored by a row of an array.  
  
'''Выходные данные''': нижняя двухдиагональная матрица <math>L</math> (элементы <math>l_{ij}</math>, причём <math>l_{ii}=1</math>) и верхняя двухдиагональная матрица <math>U</math> (элементы <math>u_{ij}</math>).
+
'''Output data''': lower bi-diagonal matrix <math>L</math> (with entries <math>l_{ij}</math>, where <math>l_{ii}=1</math>) and upper bi-diagonal matrix <math>U</math> (with entries <math>u_{ij}</math>).
  
'''Объём выходных данных''': формально <math>3n-2</math>. Однако благодаря совпадению <math>n</math> входных и выходных данных реально вычисляется лишь <math>2n-2</math> элементов.
+
'''Size of the output data''' is formally <math>3n-2</math>. However, since <math>n</math>  
 +
items in the input and output data are identical, actually only <math>2n-2</math> items are calculated.
  
 
=== Properties of the algorithm ===
 
=== Properties of the algorithm ===
  
Соотношение последовательной и параллельной сложности, как хорошо видно, равно <math>O(n)</math>.  
+
It is clearly seen that the ratio of the serial to parallel complexity is <math>O(n)</math>.  
  
При этом вычислительная мощность алгоритма как отношение числа операций к суммарному объему входных и выходных данных является ''логарифмической''.
+
The computational power of the algorithm, understood as the ratio of the number of operations to the total size of the input and output data, is ''logarithmic''.
  
Алгоритм в рамках выбранной версии полностью детерминирован.
+
The algorithm, within the chosen version, is completely determined.  
  
Устойчивость алгоритма остаётся только при условии невозрастания ведущих главных миноров матрицы. При росте миноров область устойчивости алгоритма очень мала.
+
The algorithm is numerically stable only if the leading principal minors of the matrix do not grow. If they can grow significantly, then the stability region of the algorithm is very narrow.
  
 
== Software implementation of the algorithm ==
 
== Software implementation of the algorithm ==
Line 158: Line 155:
 
=== Implementation peculiarities of the serial algorithm ===
 
=== Implementation peculiarities of the serial algorithm ===
  
Из-за большой избыточности вычислений алгоритм сдваивания Стоуна никогда не предназначался для последовательной реализации. После обнаружения его неустойчивости стало ясно, что и в будущем он не будет реализован на последовательных архитектурах.
+
Due to its great redundancy in calculations, the Stone doubling algorithm never was targeted on a serial implementation. After its numerical instability had been detected, it became clear that, in the future, it also will not be implemented on serial architectures.
 
 
=== Locality of data and computations ===
 
 
 
Хотя исследование локальности и не имеет смысла из-за неустойчивости, можно сказать, что, как видно по графу алгоритма, ряд дуг длинны как по времени (по различию номеров ярусов операций, являющихся началом и концом дуги), так и по пространству (исключением является только размещение в гиперкубе, физически невозможное). Эта неустранимая нелокальность должна тормозить исполнение алгоритма. Реальное же исследование последовательного кода на обращения в память проводить бессмысленно, поскольку последовательный код не применяется и не будет применяться никем.
 
  
 
=== Possible methods and considerations for parallel implementation of the algorithm  ===
 
=== Possible methods and considerations for parallel implementation of the algorithm  ===
  
Из-за неустойчивости алгоритма любые способы реализации теряют смысл.  
+
Because of numerical instability, any implementations of the algorithm make no sense.
  
=== Scalability of the algorithm and its implementations ===
+
The numerical instability of the algorithm makes senseless any analysis of locality. Nevertheless, the algorithm graph shows the presence of a number of arcs that are long both in time (meaning the difference between the indices of layers to which the endpoints of an arc belong) and in space (excluding the allocation in a hypercube, which is physically impossible). This inherent non-locality must hamper the execution of the algorithm. An actual analysis of memory accesses for a serial code makes no sense because such a code is not used now and will not be used by anybody in the future.
  
Из-за неустойчивости алгоритма не имеют смысла любые разговоры о масштабируемости любых его реализаций.  
+
The numerical instability of the algorithm makes senseless any measurements of the efficiency and dynamic characteristics of its implementations. Moreover, the presence of redundant calculations significantly limits the actual efficiency even for small size problems, which are not yet affected by instability.
  
=== Dynamic characteristics and efficiency of the algorithm implementation ===
+
Due to its numerical instability, the algorithm is not used in practice. According to the original publication <ref name="STONE" />, it was designed to replace the more popular [[Complete_cyclic_reduction|cyclic reduction]]. However, on that score, the algorithm has failed.
 
 
Из-за неустойчивости алгоритма не имеют смысла любые замеры эффективности любых его реализаций и их динамических характеристик. Кроме того, из-за избыточности вычислений ясно, что реальная эффективность будет существенно ограничена даже при малых размерах задач, где неустойчивость ещё не сказывается.
 
  
 +
=== Run results ===
 
=== Conclusions for different classes of computer architecture ===
 
=== Conclusions for different classes of computer architecture ===
  
На области относительно приемлемых ошибок (малые размеры задач) у метода всё равно нет шансов на нормальную реализацию - быстрее будет СЛАУ прогонкой на обычной персоналке решить.
+
Even in the range of reasonable round-off errors (small size problems), an efficient implementation of the method is impossible. A faster approach would be to solve a SLAE on a common PC using the Thomas algorithm.
  
=== Existing implementations of the algorithm ===
+
== References ==
 
 
Из-за неустойчивости алгоритма его не используют на практике, поэтому планировавшаяся в исходной публикации<ref name="STONE" /> замена более популярной [[Полный метод циклической редукции|циклической редукции]] не удалась.
 
  
== References ==
 
 
<references />
 
<references />
  
 
[[Category:Finished articles]]
 
[[Category:Finished articles]]
 
[[Категория:Метод сдваивания]]
 
[[Категория:Неустойчивые_параллельные_методы]]
 
[[Категория:Алгоритмы с избыточными вычислениями]]
 
[[Категория:Разложения матриц]]
 
  
 
[[ru:Алгоритм сдваивания Стоуна для LU-разложения трёхдиагональной матрицы]]
 
[[ru:Алгоритм сдваивания Стоуна для LU-разложения трёхдиагональной матрицы]]

Latest revision as of 15:51, 8 July 2022


Stone doubling algorithm for the LU decomposition of tridiagonal matrices
Sequential algorithm
Serial complexity [math]3(n-1)\lceil \log_2 (n-1) \rceil + o((n-1)\lceil \log_2 (n-1) \rceil)[/math]
Input data [math]3n-2[/math]
Output data [math]2n-1[/math]
Parallel algorithm
Parallel form height [math]2 \lceil \log_2 (n-1) \rceil + 3[/math]
Parallel form width [math]n[/math]


Main author: Alexey Frolov

1 Properties and structure of the algorithm

1.1 General description of the algorithm

The Stone doubling algorithm for the LU decomposition of tri-diagonal matrices is a part of the Stone doubling algorithm for solving SLAEs [1][2] of the form [math]Ax = b[/math], where

[math] A = \begin{bmatrix} a_{11} & a_{12} & 0 & \cdots & \cdots & 0 \\ a_{21} & a_{22} & a_{23}& \cdots & \cdots & 0 \\ 0 & a_{32} & a_{33} & \cdots & \cdots & 0 \\ \vdots & \vdots & \ddots & \ddots & \ddots & 0 \\ 0 & \cdots & \cdots & a_{n-1 n-2} & a_{n-1 n-1} & a_{n-1 n} \\ 0 & \cdots & \cdots & 0 & a_{n n-1} & a_{n n} \\ \end{bmatrix}, x = \begin{bmatrix} x_{1} \\ x_{2} \\ \vdots \\ x_{n} \\ \end{bmatrix}, b = \begin{bmatrix} b_{1} \\ b_{2} \\ \vdots \\ b_{n} \\ \end{bmatrix} [/math]

The Stone doubling algorithm was first proposed in early 70's of 20-th century [3] as an alternative to other parallel algorithms for solving tri-diagonal SLAEs such as, for instance, the cyclic reduction method.

Here, we consider the first part of the algorithm that concerns the [math]LU[/math] decomposition. The aim is to represent the matrix

[math] A = \begin{bmatrix} a_{11} & a_{12} & 0 & \cdots & \cdots & 0 \\ a_{21} & a_{22} & a_{23}& \cdots & \cdots & 0 \\ 0 & a_{32} & a_{33} & \cdots & \cdots & 0 \\ \vdots & \vdots & \ddots & \ddots & \ddots & 0 \\ 0 & \cdots & \cdots & a_{n-1 n-2} & a_{n-1 n-1} & a_{n-1 n} \\ 0 & \cdots & \cdots & 0 & a_{n n-1} & a_{n n} \\ \end{bmatrix} [/math]

as the product of the matrices

[math] L = \begin{bmatrix} 1 & 0 & 0 & \cdots & \cdots & 0 \\ l_{21} & 1 & 0 & \cdots & \cdots & 0 \\ 0 & l_{32} & 1 & \cdots & \cdots & 0 \\ \vdots & \vdots & \ddots & \ddots & \ddots & 0 \\ 0 & \cdots & \cdots & l_{n-1 n-2} & 1 & 0 \\ 0 & \cdots & \cdots & 0 & l_{n n-1} & 1 \\ \end{bmatrix} [/math]

and

[math] U = \begin{bmatrix} u_{11} & u_{12} & 0 & \cdots & \cdots & 0 \\ 0 & u_{22} & u_{23}& \cdots & \cdots & 0 \\ 0 & 0 & u_{33} & \cdots & \cdots & 0 \\ \vdots & \vdots & \ddots & \ddots & \ddots & 0 \\ 0 & \cdots & \cdots & 0 & u_{n-1 n-1} & u_{n-1 n} \\ 0 & \cdots & \cdots & 0 & 0 & u_{n n} \\ \end{bmatrix} [/math]

An important point is the fact that, in exact arithmetic, the Stone doubling algorithm finds the same decomposition as that in the compact scheme of Gaussian elimination.

Theoretically, the Stone method is based on the following considerations. Define the values [math]q_i[/math] by the formulas [math]q_0 = 1[/math], [math]q_1 = a_{11}[/math], and [math]q_i = a_{ii} q_{i-1} - a_{ii-1}a_{i-1i} q_{i-2}[/math]. As soon as all the [math]q_i[/math] are calculated, one can easily find all the entries [math]u_{ii} = q_i/q_{i-1}[/math] and [math]l_{ii-1} = a_{ii-1}/u_{i-1i-1}[/math]. The entries [math]u_{ii+1} = a_{ii+1}[/math] need not be calculated.

When calculating the values [math]q_i[/math], one uses the associativity of matrix multiplication. The recursion relation [math]q_i = a_{ii} q_{i-1} - a_{ii-1}a_{i-1i} q_{i-2}[/math] implies the matrix equality

[math] \begin{bmatrix} q_i \\ q_{i-1} \\ \end{bmatrix} = \begin{bmatrix} a_{ii} & -a_{ii-1}a_{i-1i} \\ 1 & 0 \\ \end{bmatrix} \begin{bmatrix} q_{i-1} \\ q_{i-2} \\ \end{bmatrix} = T_i \begin{bmatrix} q_{i-1} \\ q_{i-2} \\ \end{bmatrix} = T_i T_{i-1}...T_2 \begin{bmatrix} q_{1} \\ q_{0} \\ \end{bmatrix} = T_i T_{i-1}...T_2 \begin{bmatrix} a_{11} \\ 1 \\ \end{bmatrix} [/math]

Matrix products [math]T_i T_{i-1}...T_2[/math] are calculated using the parallel doubling scheme. Note that the second row of each matrix [math]T_i[/math] is identical to the first row of the identity matrix. It follows that the second row of the product [math]T_i T_{i-1}...T_k[/math] is identical to the first row of the product [math]T_{i-1}...T_k[/math], which makes it possible to halve the number of arithmetic operations compared to the analogous product of general matrices of the same order .

Figure 1. Graph of calculating the product of the matrices [math]T_{i}[/math] for [math]n=9[/math]. Each node at the first layer corresponds to a composite operation consisting of a single multiplication and a single operation of the type [math]a+bc[/math], while each node at other layers corresponds to two operations of the type [math]ab+cd[/math]. Results calculated by black nodes are used at later stages, and intermediate results are calculated by light nodes

1.2 Mathematical description of the algorithm

First, for all [math]i[/math] from 2 to [math]n[/math], we set

[math]p_i = a_{ii}, r_i = -a_{ii-1}a_{i-1i}[/math],

and

[math]p_1 = 0, r_1 = 1[/math].

At step [math]k[/math], the calculations are conducted as follows:

Beginning from [math]i[/math] = 2, skip [math]2^{k-1}[/math] numbers; for the next [math]2^{k-1}[/math] numbers, set [math]j[/math] equal to the nearest smaller integer of the form [math]j = 1 + 2^{k} m [/math].

At the first step, the values [math]p_i, r_i[/math] are updated by the formulas [math]p^{new}_i = p_i p_{i-1}, r^{new}_i = p_i r_{i-1} + r_i[/math]; at the other steps, they are updated using the formulas [math]p^{new}_i = p_i p_{j}+r_i p_{j-1}, r^{new}_i = p_i r_{j} + r_i r_{j-1}[/math].

Then, again, [math]2^{k-1}[/math] numbers are skipped, and, for the next [math]2^{k-1}[/math] numbers, calculations similar to those above are performed, and so on until the range of [math]i[/math] from 2 to [math]n[/math] is run out.

The above steps are repeated until it occurs that all the values of [math]i[/math] from 2 to [math]n[/math] should be skipped. After that, one calculates [math]q_i = p_i a_{11} + r_i[/math] for all [math]i[/math] from 1 to [math]n[/math], then [math]u_{ii} = q_i/q_{i-1}[/math] and [math]l_{ii-1} = a_{ii-1}/u_{i-1i-1}[/math] for all [math]i[/math] from 2 to [math]n[/math].

1.3 Computational kernel of the algorithm

The computational kernel of the algorithm is mainly composed of operations of the type [math]ab+cd[/math] and [math]ab+c[/math] with a small addition of multiplications and divisions. There are no isolated long sequences of calculations.

1.4 Macro structure of the algorithm

At the macro-level, the following macro-operations can be set out: 1. Calculation of the entries of the matrices [math]T_i[/math]. 2. Calculation of the products of the matrices [math]T_{i}[/math] using the doubling scheme. 3. Calculation of the ultimate results.

1.5 Implementation scheme of the serial algorithm

The Stone method was from the outset designed for a parallel implementation because it involves redundant calculations compared to, say, the classical Thomas algorithm. A serial implementation of the method makes no sense also for the reason that it is numerically unstable.

1.6 Serial complexity of the algorithm

For a full execution of the Stone algorithm resulting in the decomposition of the given tri-diagonal matrix into the product of two bi-diagonal matrices, one should perform

  • [math]2n-2[/math] divisions,
  • [math]2(n-1)\lceil \log_2 (n-1) \rceil + o((n-1)\lceil \log_2 (n-1) \rceil)[/math] multiplications, and
  • [math](n-1)\lceil \log_2 (n-1) \rceil +o((n-1)\lceil \log_2 (n-1) \rceil)[/math] additions.

Consequently, in terms of serial complexity, the algorithm should be qualified as a linear-logarithmic complexity algorithm.

1.7 Information graph

As already noted, the macro-structure of the algorithm consists of three parts.

In the first part, off-diagonal entries of the matrices [math]T_{i}[/math] are calculated.

In the second part, products of the matrices [math]T_{i}[/math] are calculated using the doubling technique. This calculation is shown in Figure 1; it uses the results of the first part.

In the concluding third part, the final results are calculated with the use of the results of the second part.

The inner graphs of the first and third parts are empty; their operations are only related to the second part.

1.8 Parallelization resource of the algorithm

The critical path of the Stone algorithm for the LU decomposition of tri-diagonal matrices consists of

  • [math]1[/math] layer of divisions,
  • [math]\lceil \log_2 (n-1) \rceil+1[/math] layers of multiplications, and
  • [math]\lceil \log_2 (n-1) \rceil+1[/math] layers of additions.

Consequently, in terms of the parallel form height, the algorithm should be qualified as a logarithmic complexity algorithm. The width of a layer is [math]n[/math]; therefore, in terms of the parallel form width, the algorithm should be qualified as a linear complexity algorithm.

1.9 Input and output data of the algorithm

Input data: tri-diagonal matrix [math]A[/math] (with entries [math]a_{ij}[/math]).

Size of the input data: [math]3n-2[/math]. In different implementations, information can be stored differently for economy reasons. For instance, each diagonal of the matrix can be stored by a row of an array.

Output data: lower bi-diagonal matrix [math]L[/math] (with entries [math]l_{ij}[/math], where [math]l_{ii}=1[/math]) and upper bi-diagonal matrix [math]U[/math] (with entries [math]u_{ij}[/math]).

Size of the output data is formally [math]3n-2[/math]. However, since [math]n[/math] items in the input and output data are identical, actually only [math]2n-2[/math] items are calculated.

1.10 Properties of the algorithm

It is clearly seen that the ratio of the serial to parallel complexity is [math]O(n)[/math].

The computational power of the algorithm, understood as the ratio of the number of operations to the total size of the input and output data, is logarithmic.

The algorithm, within the chosen version, is completely determined.

The algorithm is numerically stable only if the leading principal minors of the matrix do not grow. If they can grow significantly, then the stability region of the algorithm is very narrow.

2 Software implementation of the algorithm

2.1 Implementation peculiarities of the serial algorithm

Due to its great redundancy in calculations, the Stone doubling algorithm never was targeted on a serial implementation. After its numerical instability had been detected, it became clear that, in the future, it also will not be implemented on serial architectures.

2.2 Possible methods and considerations for parallel implementation of the algorithm

Because of numerical instability, any implementations of the algorithm make no sense.

The numerical instability of the algorithm makes senseless any analysis of locality. Nevertheless, the algorithm graph shows the presence of a number of arcs that are long both in time (meaning the difference between the indices of layers to which the endpoints of an arc belong) and in space (excluding the allocation in a hypercube, which is physically impossible). This inherent non-locality must hamper the execution of the algorithm. An actual analysis of memory accesses for a serial code makes no sense because such a code is not used now and will not be used by anybody in the future.

The numerical instability of the algorithm makes senseless any measurements of the efficiency and dynamic characteristics of its implementations. Moreover, the presence of redundant calculations significantly limits the actual efficiency even for small size problems, which are not yet affected by instability.

Due to its numerical instability, the algorithm is not used in practice. According to the original publication [3], it was designed to replace the more popular cyclic reduction. However, on that score, the algorithm has failed.

2.3 Run results

2.4 Conclusions for different classes of computer architecture

Even in the range of reasonable round-off errors (small size problems), an efficient implementation of the method is impossible. A faster approach would be to solve a SLAE on a common PC using the Thomas algorithm.

3 References

  1. Voevodin V.V. Computational foundations of linear algebra Moscow: Nauka, 1977.
  2. Voevodin V.V., Kuznetsov Yu.A. Matrices and computations, Moscow: Nauka, 1984.
  3. 3.0 3.1 Stone H.S. An Efficient Parallel Algorithm for the Solution of a Tridiagonal Linear System of Equations // J. ACM, Vol. 20, No. 1 (Jan. 1973), P. 27-38.