Givens method

From Algowiki
Revision as of 11:16, 24 January 2016 by Икрамов (talk | contribs)
Jump to navigation Jump to search


Primary authors of this description: A.V.Frolov, Vad.V.Voevodin (Section 2.2)


1 Properties and structure of the algorithm[edit]

1.1 General description of the algorithm

Givens' method (which is also called the rotation method in the Russian mathematical literature) is used to represent a matrix in the form $A = QR$, where $Q$ is a unitary and $R$ is an upper triangular matrix. The matrix $Q$ is not stored and used in its explicit form but rather as the product of rotations. Each (Givens) rotation can be specified by a pair of indices and a single parameter.

Template:Шаблон:Матрица вращения

In a conventional implementation of Givens' method, this fact makes it possible to avoid using additional arrays by storing the results of decomposition in the array originally occupied by $A$. Various uses are possible for the $QR$ decomposition of $A$. It can be used for solving a SLAE (System of Linear Algebraic Equations) $Ax = b$ or as a step in the so-called $QR$ algorithm for finding the eigenvalues of a matrix.

At each step of Givens' method, two rows of the matrix under transformation are rotated. The parameter of this transformation is chosen so as to eliminate one of the entries in the current matrix. First, the entries in the first column are eliminated one after the other, then the same is done for the second column, etc., until the column $n-1$. The resulting matrix is $R$. The step of the method is split into two parts: the choice of the rotation parameter and the rotation itself performed over two rows of the current matrix. The entries of these rows located to the left of the pivot column are zero; thus, no modifications are needed there. The entries in the pivot column are rotated simultaneously with the choice of the rotation parameter. Hence, the second part of the step consists in rotating two-dimensional vectors formed of the entries of the rotated rows that are located to the right of the pivot column. In terms of operations, the update of a column is equivalent to multiplying two complex numbers (or to four multiplications, one addition and one subtraction for real numbers); one of these complex numbers is of modulus 1. The choice of the rotation parameter from the two entries of the pivot column is a more complicated procedure, which is explained, in particular, by the necessity of minimizing roundoff errors. The tangent [math]t[/math] of half the rotation angle is normally used to store information about the rotation matrix. The cosine [math]c[/math] and the sine [math]s[/math] of the rotation angle itself are related to [math]t[/math] via the simple formulas (the so-called combat formulas of trigonometry)

[math]c = (1 - t^2)/(1 + t^2), s = 2t/(1 + t^2)[/math]

It is the value [math]t[/math] that is usually stored in the corresponding array entry.

1.2 Mathematical description of the algorithm

In order to obtain the [math]QR[/math] decomposition of a square matrix [math]A[/math], this matrix is reduced to the upper triangular matrix [math]R[/math] (where [math]R[/math] means right) by successively multiplying [math]A[/math] on the left by the rotations [math]T_{1 2}, T_{1 3}, ..., T_{1 n}, T_{2 3}, T_{2 4}, ..., T_{2 n}, ... , T_{n-2 n}, T_{n-1 n}[/math].

Each [math]T_{i j}[/math] specifies a rotation in the two-dimensional subspace determined by the [math]i[/math]-th and [math]j[/math]-th components of the corresponding column; all the other components are not changed. The rotation is chosen so as to eliminate the entry in the position ([math]i[/math], [math]j[/math]). Zero vectors do not change under rotations and identity transformations; therefore, the subsequent rotations preserve zeros that were earlier obtained to the left and above the entry under elimination.

At the end of the process, we obtain [math]R=T_{n-1 n}T_{n-2 n}T_{n-2 n-1}...T_{1 3}T_{1 2}A[/math].

Since rotations are unitary matrices, we naturally have [math]Q=(T_{n-1 n}T_{n-2 n}T_{n-2 n-1}...T_{1 3}T_{1 2})^* =T_{1 2}^* T_{1 3}^* ...T_{1 n}^* T_{2 3}^* T_{2 4}^* ...T_{2 n}^* ...T_{n-2 n}^* T_{n-1 n}^*[/math] and [math]A=QR[/math].

In the real case, rotations are orthogonal matrices; hence, [math]Q=(T_{n-1 n}T_{n-2 n}T_{n-2 n-1}...T_{1 3}T_{1 2})^T =T_{1 2}^T T_{1 3}^T ...T_{1 n}^T T_{2 3}^T T_{2 4}^T ...T_{2 n}^T ...T_{n-2 n}^T T_{n-1 n}^T[/math].

To complete this mathematical description, it remains to specify how the rotation [math]T_{i j}[/math] is calculated [1] and list the formulas for rotating the current intermediate matrix.

Let the matrix to be transformed contain the number [math]x[/math] in its position [math](i,i)[/math] and the number [math]y[/math] in the position [math](i,j)[/math]. Then, to minimize roundoff errors, we first calculate the uniform norm of the vector [math]z = max (|x|,|y|)[/math].

If the norm is zero, then no rotation is required: [math]t=s=0, c=1[/math].

If [math]z=|x|[/math], then we calculate [math]y_1=y/x[/math] and, next, [math]c = \frac {1}{\sqrt{1+y_1^2}}[/math], [math]s=-c y_1[/math], [math]t=\frac {1-\sqrt{1+y_1^2}}{y_1}[/math]. The updated value of the entry [math](i,i)[/math] is [math]x \sqrt{1+y_1^2}[/math].

If [math]z=|y|[/math], then we calculate [math]x_1=x/y[/math] and, next, [math]t=x_1 - x_1^2 sign(x_1)[/math], [math]s=\frac{sign(x_1)}{\sqrt{1+x_1^2}}[/math], [math]c = s x_1[/math]. The updated value of the entry [math](i,i)[/math] is [math]y \sqrt{1+x_1^2} sign(x)[/math].

Let the parameters [math]c[/math] and [math]s[/math] of the rotation [math]T_{i j}[/math] have already been obtained. Then the transformation of each column located to the right of the [math]i[/math]-th column can be described in a simple way. Let the [math]k[/math]-th column have x as its component [math]i[/math] and y as its component [math]j[/math]. The updated values of these components are [math]cx - sy[/math] and [math]sx + cy[/math], respectively. This calculation is equivalent to multiplying the complex number with the real part [math]x[/math] and the imaginary part [math]y[/math] by the complex number [math](c,s)[/math].


1.3 Computational kernel of the algorithm

The computational kernel of this algorithm can be thought of as compiled of two types of operation. The first type concerns the calculation of rotation parameters, while the second deals with the rotation itself (which can equivalently be described as the multiplication of two complex numbers with one of the factors having the modulus 1).


1.4 Macro structure of the algorithm

The operations related to the calculation of rotation parameters can be represented by a triangle on a two-dimensional grid, while the rotation itself can be represented by a pyramid on a three-dimensional grid.


1.5 Implementation scheme of the serial algorithm

In a conventional implementation scheme, the algorithm is written as the successive elimination of the subdiagonal entries of a matrix beginning from its first column and ending with the penultimate column (that is, column n-1). When the i-th column is "eliminated", then its components i+1 to n are successively eliminated.

The elimination of the entry (j, i) consists of two steps: (a) calculating the parameters for the rotation [math]T_{ij}[/math] that eliminates the entry (j, i); (b) multiplying the current matrix on the left by the rotation [math]T_{ij}[/math].


1.6 Serial complexity of the algorithm

The complexity of the serial version of this algorithm is basically determined by the mass rotation operations. If possible sparsity of a matrix is ignored, these operations are responsible (in the principal term) for [math]n^3/3[/math] complex multiplications. In a straightforward complex arithmetic, this is equivalent to [math]4n^3/3[/math] real multiplications and [math]2n^3/3[/math] real additions/subtractions.

Thus, in terms of serial complexity, Givens' method is qualified as a cubic complexity algorithm.


1.7 Information graph

The macrograph of the algorithm is shown in fig. 1, while the graphs of the macrovertices are depicted in the subsequent figures.

Figure 1. Graph of the algorithm (the input and output data are not shown). n=4. F1 denotes an operation of calculating rotation parameters, and F2 denotes a rotation of a 2-dimensional vector (which is equivalent to multiplying two complex numbers).

Choice of a method for calculating rotation parameters in vertices of type F1.


Figure 2. Calculation of the rotation parameters for various x and y in the vertex V2.


Figure 3. Calculation of the rotation parameters in the vertex V1 (the case of identical x and у).


Figure 4. Calculation of the rotation parameters in the vertex V0 (the case of zero x and у).


Figure 5. The inner graph of a vertex of type F2 with its input and output parameters: (u,v) = (c,s)(x,y)


1.8 Parallelization resource of the algorithm

In order to better understand the parallelization resource of Givens' decomposition of a matrix of order n, consider the critical path of its graph.

It is evident from the description of subgraphs that the macrovertex F1 (calculation of the rotation parameters) is much more "weighty" than the rotation vertex F2. Namely, the critical path of a rotation vertex consists of only one multiplication (there are four multiplications, but all of them can be performed in parallel) and one addition/subtraction (there are two operations of this sort, but they also are parallel). On the other hand, in the worst case, a macrovertex for calculating rotation parameters has the critical path consisting of a single square root calculation, two divisions, two multiplications, and two additions/subtractions.

According to a rough estimate, the critical path goes through 2n-3 macrovertices of type F1 (calculating rotation parameters) and n-1 rotation macrovertices. Altogether, this yields a critical path passing through 2n-3 square root extractions, 4n-6 divisions, 5n-7 multiplications, and the same number of additions/subtractions. In the macrograph shown in the figure, one of the critical paths consists of the passage through the upper line of F1 vertices (there are n-1 vertices) accompanied by the alternate execution of F2 and F1 (n-2 times) and the final execution of F2.

Thus, unlike in the serial varsion, square root calculations and divisions take a fairly considerable portion of the overall time required for the parallel variant. The presence of isolated square root calculations and divisions in some tiers of the parallel form can also create other problems when the algorithm is implemented for a specific architecture. Consider, for instance, an implementation for PLDs. Other operations (multiplications and additions/subtractions) can be pipelined, which also saves resources. On the other hand, isolated square root calculations take resources that are idle most of the time.

In terms of the parallel form height, the Givens method is qualified as a linear complexity algorithm. In terms of the parallel form width, its complexity is quadratic.


1.9 Input and output data of the algorithm

Input data: dense square matrix A (with entries a_{ij}).

Size of the input data: n^2.

Output data: upper triangular matrix R (in the serial version, the nonzero entries r_{ij} are stored in the positions of the original entries a_{ij}), unitary (or orthogonal) matrix Q stored as the product of rotations (in the serial version, the rotation parameters t_{ij} are stored in the positions of the original entries a_{ij}).

Size of the output data: n^2.


1.10 Properties of the algorithm

It is clearly seen that the ratio of the serial to parallel complexity is quadratic, which is a good incentive for parallelization.

The computational power of the algorithm, understood as the ratio of the number of operations to the total size of the input and output data, is linear.

Within the framework of the chosen version, the algorithm is completely determined.

The roundoff errors in Givens' (rotations) method grow linearly, as they also do in Householder (reflections) method.


2 Software implementation of the algorithm

2.1 Implementation peculiarities of the serial algorithm

In its simplest version, the QR decomposition of a real square matrix by Givens' method can be written in Fortran as follows:

DO I = 1, N-1 DO J = I+1, N

                   CALL PARAMS (A(I,I), A(J,I), C, S)
                   DO K = I+1, N
                      CALL ROT2D (C, S, A(I,K), A(J,K))
                   END DO
               END DO

END DO

Suppose that the translator at hand competently implements operations with complex numbers. Then the rotation subroutine ROT2D can be written as follows:

SUBROUTINE ROT2D (C, S, X, Y)

       COMPLEX Z
       REAL ZZ(2)
       EQUIVALENCE Z, ZZ
       Z = CMPLX(C, S)*CMPLX(X,Y)
       X = ZZ(1)
       Y = ZZ(2)
       RETURN
       END

or SUBROUTINE ROT2D (C, S, X, Y)

       ZZ = C*X - S*Y
       Y = S*X + C*Y
       X = ZZ
       RETURN
       END 

(from the viewpoint of the algorithm graph, these subroutines are equivalent).

The subroutine for calculating the rotation parameters can have the following form:

SUBROUTINE PARAMS (X, Y, C, S)

       Z = MAX (ABS(X), ABS(Y))
       IF (Z.EQ.0.) THEN

C OR (Z.LE.OMEGA) WHERE OMEGA - COMPUTER ZERO

           C = 1.
           S = 0.
       ELSE IF (Z.EQ.ABS(X))
           R = Y/X
           RR = R*R
           RR2 = SQRT(1+RR)
           X = X*RR2
           Y = (1-RR2)/R
           C = 1./RR2
           S = -C*R
       ELSE
           R = X/Y
           RR = R*R
           RR2 = SQRT (1+RR)
           X = SIGN(Y, X)*RR
           Y = R - SIGN(RR,R)
           S = SIGN(1./RR2, R) 
           C = S*R  
       END IF
       RETURN
       END 

In the above implementation, the rotation parameter t is written to a vacant location (the entry of the modified matrix with the corresponding indices is known to be zero). This makes it possible to readily reconstruct rotation matrices (if required).


2.2 Locality of data and computations

2.2.1 Locality of implementation

2.2.1.1 Structure of memory access and a qualitative estimation of locality

Figure 6. Givens' (rotation) method for the QR decomposition of a square matrix (real version). The general memory access profile

Figure 6 presents the memory access profile for an implementation of the real version of the QR decomposition of a square matrix by Givens' method. This profile is formed of accesses to a single two-dimensional array storing matrix values. The profile consists of iterations of the same kind, which is clearly seen from the graph. The i-th iteration affects the array entries with indices beginning from (i-1)*k; that is, after each iteration, the first k entries are no longer processed (the value of k is not known at the moment). Each iteration consists of two parts performed in parallel, namely, the successive sorting of all the entries beginning from the entry (i-1)*k and the active use of the first i*k entries.

Judging from the general picture, one can say that the locality of this profile is fairly high. Indeed, accesses to the entries close in memory are also close in program, and there are well localized sections where the data are frequently used repeatedly. However, a more detailed analysis is needed for verifying these observations.

A fragment of the general profile (set out in green) is shown in fig. 7. It can be seen that, at each iteration, both parallel processes consist of small pieces resembling the conventional successive sorting. One can clearly recognize a regular structure: all the pieces have the same size and are separated by the same distance.

Figure 7. An isolated fragment of the general profile


Рассмотрим фрагмент на рис. 7 более подробно (рис. 8). Здесь уже видны отдельные обращения, и теперь можно с уверенностью сказать, что каждый участок представляет собой последовательный перебор небольшого числа элементов. При этом в верхней части на каждом новом участке перебираются новые элементы, а в нижней – одни и те же данные. Поэтому в целом подобный фрагмент обладает высокой пространственной локальностью (обе части состоят из последовательных переборов), но средней временной локальностью (верхняя часть – очень низкой, нижняя – очень высокой локальностью).

Рисунок 8. Фрагмент общего профиля, дальнейшее приближение Общий профиль состоит из набора таких фрагментов. При этом, согласно рис. 6 и 7, данные в этих фрагментах используются повторно на разных итерациях, поэтому общий профиль, скорее всего, обладает более высокой временной локальностью. Поэтому в целом можно сказать, что обращения в память в данной программе обладают высокой пространственной локальностью и достаточно неплохой временной локальностью. 2.2.1.2 Количественная оценка локальности[править] Основной фрагмент реализации, с помощью которого были получены количественные оценки, приведен здесь (функция Kernel). Условия запуска описаны здесь. Первая оценка выполняется на основе характеристики daps, которая оценивает число выполненных обращений (чтений и записей) в память в секунду. Данная характеристика является аналогом оценки flops применительно к работе с памятью и является в большей степени оценкой производительности взаимодействия с памятью, чем оценкой локальности. Однако она служит хорошим источником информации, в том числе для сравнения с результатами по следующей характеристике cvg. На рисунке 9 приведены значения daps для реализаций распространенных алгоритмов, отсортированные по возрастанию (чем больше daps, тем в общем случае выше производительность). Можно увидеть, что производительность работы с памятью достаточно высока для этой программы, что соответствует нашим представлениям согласно изученной локальности.

Рисунок 9. Сравнение значений оценки daps Вторая характеристика – cvg – предназначена для получения более машинно-независимой оценки локальности. Она определяет, насколько часто в программе необходимо подтягивать данные в кэш-память. Соответственно, чем меньше значение cvg, тем реже это нужно делать и тем лучше локальность.

На рисунке 10 приведены значения cvg для того же набора реализаций, отсортированные по убыванию (чем меньше cvg, тем в общем случае выше локальность). Можно увидеть, что, как и daps, cvg показывает достаточно хороший результат, тем самым подтверждая качественную оценку, описанную выше.

  1. Воеводин В.В. Вычислительные основы линейной алгебры. М.: Наука, 1977.