Givens method
Givens (rotation) method for the QR decomposition of a square matrix | |
Sequential algorithm | |
Serial complexity | [math]2 n^3[/math] |
Input data | [math]n^2[/math] |
Output data | [math]n^2[/math] |
Parallel algorithm | |
Parallel form height | [math]11n-16[/math] |
Parallel form width | [math]O(n^2)[/math] |
Primary authors of this description: А.В.Фролов, Вад.В.Воеводин (Section 2.2)
Contents
- 1 Properties and structure of the algorithm
- 1.1 General description of the algorithm
- 1.2 Mathematical description of the algorithm
- 1.3 Computational kernel of the algorithm
- 1.4 Macro structure of the algorithm
- 1.5 Implementation scheme of the serial algorithm
- 1.6 Serial complexity of the algorithm
- 1.7 Information graph
- 1.8 Parallelization resource of the algorithm
- 1.9 Input and output data of the algorithm
- 1.10 Properties of the algorithm
- 2 Software implementation of the algorithm
- 2.1 Implementation peculiarities of the serial algorithm
- 2.2 Locality of data and computations
- 2.3 Possible methods and considerations for parallel implementation of the algorithm
- 2.4 Scalability of the algorithm and its implementations
- 2.5 Dynamic characteristics and efficiency of the algorithm implementation
- 2.6 Conclusions for different classes of computer architecture
- 2.7 Existing implementations of the algorithm
- 3 References
1 Properties and structure of the algorithm
1.1 General description of the algorithm
Givens method (which is also called the rotation method in the Russian mathematical literature) is used to represent a matrix in the form [math]A = QR[/math], where [math]Q[/math] is a unitary and [math]R[/math] is an upper triangular matrix[1]. The matrix [math]Q[/math] is not stored and used in its explicit form but rather as the product of rotations. Each (Givens) rotation can be specified by a pair of indices and a single parameter.
In a conventional implementation of Givens method, this fact makes it possible to avoid using additional arrays by storing the results of decomposition in the array originally occupied by [math]A[/math]. Various uses are possible for the [math]QR[/math] decomposition of [math]A[/math]. It can be used for solving a SLAE (System of Linear Algebraic Equations) [math]Ax = b[/math] or as a step in the so-called [math]QR[/math] algorithm for finding the eigenvalues of a matrix.
At each step of Givens method, two rows of the matrix under transformation are rotated. The parameter of this transformation is chosen so as to eliminate one of the entries in the current matrix. First, the entries in the first column are eliminated one after the other, then the same is done for the second column, etc., until the column [math]n-1[/math]. The resulting matrix is [math]R[/math]. The step of the method is split into two parts: the choice of the rotation parameter and the rotation itself performed over two rows of the current matrix. The entries of these rows located to the left of the pivot column are zero; thus, no modifications are needed there. The entries in the pivot column are rotated simultaneously with the choice of the rotation parameter. Hence, the second part of the step consists in rotating two-dimensional vectors formed of the entries of the rotated rows that are located to the right of the pivot column. In terms of operations, the update of a column is equivalent to multiplying two complex numbers (or to four multiplications, one addition and one subtraction for real numbers); one of these complex numbers is of modulus 1. The choice of the rotation parameter from the two entries of the pivot column is a more complicated procedure, which is explained, in particular, by the necessity of minimizing roundoff errors. The tangent [math]t[/math] of half the rotation angle is normally used to store information about the rotation matrix. The cosine [math]c[/math] and the sine [math]s[/math] of the rotation angle itself are related to [math]t[/math] via the basic trigonometry formulas
[math]c = (1 - t^2)/(1 + t^2), s = 2t/(1 + t^2)[/math]
It is the value [math]t[/math] that is usually stored in the corresponding array entry.
1.2 Mathematical description of the algorithm
In order to obtain the [math]QR[/math] decomposition of a square matrix [math]A[/math], this matrix is reduced to the upper triangular matrix [math]R[/math] (where [math]R[/math] means right) by successively multiplying [math]A[/math] on the left by the rotations [math]T_{1 2}, T_{1 3}, ..., T_{1 n}, T_{2 3}, T_{2 4}, ..., T_{2 n}, ... , T_{n-2 n}, T_{n-1 n}[/math].
Each [math]T_{i j}[/math] specifies a rotation in the two-dimensional subspace determined by the [math]i[/math]-th and [math]j[/math]-th components of the corresponding column; all the other components are not changed. The rotation is chosen so as to eliminate the entry in the position ([math]i[/math], [math]j[/math]). Zero vectors do not change under rotations and identity transformations; therefore, the subsequent rotations preserve zeros that were earlier obtained to the left and above the entry under elimination.
At the end of the process, we obtain [math]R=T_{n-1 n}T_{n-2 n}T_{n-2 n-1}...T_{1 3}T_{1 2}A[/math].
Since rotations are unitary matrices, we naturally have [math]Q=(T_{n-1 n}T_{n-2 n}T_{n-2 n-1}...T_{1 3}T_{1 2})^* =T_{1 2}^* T_{1 3}^* ...T_{1 n}^* T_{2 3}^* T_{2 4}^* ...T_{2 n}^* ...T_{n-2 n}^* T_{n-1 n}^*[/math] and [math]A=QR[/math].
In the real case, rotations are orthogonal matrices; hence, [math]Q=(T_{n-1 n}T_{n-2 n}T_{n-2 n-1}...T_{1 3}T_{1 2})^T =T_{1 2}^T T_{1 3}^T ...T_{1 n}^T T_{2 3}^T T_{2 4}^T ...T_{2 n}^T ...T_{n-2 n}^T T_{n-1 n}^T[/math].
To complete this mathematical description, it remains to specify how the rotation [math]T_{i j}[/math] is calculated [2] and list the formulas for rotating the current intermediate matrix.
Let the matrix to be transformed contain the number [math]x[/math] in its position [math](i,i)[/math] and the number [math]y[/math] in the position [math](i,j)[/math]. Then, to minimize roundoff errors, we first calculate the uniform norm of the vector [math]z = max (|x|,|y|)[/math].
If the norm is zero, then no rotation is required: [math]t=s=0, c=1[/math].
If [math]z=|x|[/math], then we calculate [math]y_1=y/x[/math] and, next, [math]c = \frac {1}{\sqrt{1+y_1^2}}[/math], [math]s=-c y_1[/math], [math]t=\frac {1-\sqrt{1+y_1^2}}{y_1}[/math]. The updated value of the entry [math](i,i)[/math] is [math]x \sqrt{1+y_1^2}[/math].
If [math]z=|y|[/math], then we calculate [math]x_1=x/y[/math] and, next, [math]t=x_1 - x_1^2 sign(x_1)[/math], [math]s=\frac{sign(x_1)}{\sqrt{1+x_1^2}}[/math], [math]c = s x_1[/math]. The updated value of the entry [math](i,i)[/math] is [math]y \sqrt{1+x_1^2} sign(x)[/math].
Let the parameters [math]c[/math] and [math]s[/math] of the rotation [math]T_{i j}[/math] have already been obtained. Then the transformation of each column located to the right of the [math]i[/math]-th column can be described in a simple way. Let the [math]k[/math]-th column have x as its component [math]i[/math] and y as its component [math]j[/math]. The updated values of these components are [math]cx - sy[/math] and [math]sx + cy[/math], respectively. This calculation is equivalent to multiplying the complex number with the real part [math]x[/math] and the imaginary part [math]y[/math] by the complex number [math](c,s)[/math].
1.3 Computational kernel of the algorithm
The computational kernel of this algorithm can be thought of as compiled of two types of operation. The first type concerns the calculation of rotation parameters, while the second deals with the rotation itself (which can equivalently be described as the multiplication of two complex numbers with one of the factors having the modulus 1).
1.4 Macro structure of the algorithm
The operations related to the calculation of rotation parameters can be represented by a triangle on a two-dimensional grid, while the rotation itself can be represented by a pyramid on a three-dimensional grid.
1.5 Implementation scheme of the serial algorithm
In a conventional implementation scheme, the algorithm is written as the successive elimination of the subdiagonal entries of a matrix beginning from its first column and ending with the penultimate column (that is, column [math]n-1[/math]). When the [math]i[/math]-th column is "eliminated", then its components [math]i+1[/math] to [math]n[/math] are successively eliminated.
The elimination of the entry [math](j, i)[/math] consists of two steps: (a) calculating the parameters for the rotation [math]T_{ij}[/math] that eliminates the entry [math](j, i)[/math]; (b) multiplying the current matrix on the left by the rotation [math]T_{ij}[/math].
1.6 Serial complexity of the algorithm
The complexity of the serial version of this algorithm is basically determined by the mass rotation operations. If possible sparsity of a matrix is ignored, these operations are responsible (in the principal term) for [math]n^3/3[/math] complex multiplications. In a straightforward complex arithmetic, this is equivalent to [math]4n^3/3[/math] real multiplications and [math]2n^3/3[/math] real additions/subtractions.
Thus, in terms of serial complexity, Givens method is qualified as a cubic complexity algorithm.
1.7 Information graph
The macrograph of the algorithm is shown in fig. 1, while the graphs of the macrovertices are depicted in the subsequent figures.
1.8 Parallelization resource of the algorithm
In order to better understand the parallelization resource of Givens' decomposition of a matrix of order [math]n[/math], consider the critical path of its graph.
It is evident from the description of subgraphs that the macrovertex F1 (calculation of the rotation parameters) is much more "weighty" than the rotation vertex F2. Namely, the critical path of a rotation vertex consists of only one multiplication (there are four multiplications, but all of them can be performed in parallel) and one addition/subtraction (there are two operations of this sort, but they also are parallel). On the other hand, in the worst case, a macrovertex for calculating rotation parameters has the critical path consisting of a single square root calculation, two divisions, two multiplications, and two additions/subtractions.
According to a rough estimate, the critical path goes through [math]2n-3[/math] macrovertices of type F1 (calculating rotation parameters) and [math]n-1[/math] rotation macrovertices. Altogether, this yields a critical path passing through [math]2n-3[/math] square root extractions, [math]4n-6[/math] divisions, [math]5n-7[/math] multiplications, and the same number of additions/subtractions. In the macrograph shown in the figure, one of the critical paths consists of the passage through the upper line of F1 vertices (there are [math]n-1[/math] vertices) accompanied by the alternate execution of F2 and F1 ([math]n-2[/math] times) and the final execution of F2.
Thus, unlike in the serial version, square root calculations and divisions take a fairly considerable portion of the overall time required for the parallel variant. The presence of isolated square root calculations and divisions in some layers of the parallel form can also create other problems when the algorithm is implemented for a specific architecture. Consider, for instance, an implementation for PLDs. Other operations (multiplications and additions/subtractions) can be pipelined, which also saves resources. On the other hand, isolated square root calculations take resources that are idle most of the time.
In terms of the parallel form height, the Givens method is qualified as a linear complexity algorithm. In terms of the parallel form width, its complexity is quadratic.
1.9 Input and output data of the algorithm
Input data: dense square matrix [math]A[/math] (with entries [math]a_{ij}[/math]).
Size of the input data: [math]n^2[/math].
Output data: upper triangular matrix [math]R[/math] (in the serial version, the nonzero entries [math]r_{ij}[/math] are stored in the positions of the original entries [math]a_{ij}[/math]), unitary (or orthogonal) matrix [math]Q[/math] stored as the product of rotations (in the serial version, the rotation parameters [math]t_{ij}[/math] are stored in the positions of the original entries [math]a_{ij}[/math]).
Size of the output data: [math]n^2[/math].
1.10 Properties of the algorithm
It is clearly seen that the ratio of the serial to parallel complexity is quadratic, which is a good incentive for parallelization.
The computational power of the algorithm, understood as the ratio of the number of operations to the total size of the input and output data, is linear.
Within the framework of the chosen version, the algorithm is completely determined.
The roundoff errors in Givens (rotations) method grow linearly, as they also do in Householder (reflections) method.
2 Software implementation of the algorithm
2.1 Implementation peculiarities of the serial algorithm
In its simplest version, the QR decomposition of a real square matrix by Givens method can be written in Fortran as follows:
DO I = 1, N-1
DO J = I+1, N
CALL PARAMS (A(I,I), A(J,I), C, S)
DO K = I+1, N
CALL ROT2D (C, S, A(I,K), A(J,K))
END DO
END DO
END DO
Suppose that the translator at hand competently implements operations with complex numbers. Then the rotation subroutine ROT2D can be written as follows:
SUBROUTINE ROT2D (C, S, X, Y)
COMPLEX Z
REAL ZZ(2)
EQUIVALENCE Z, ZZ
Z = CMPLX(C, S)*CMPLX(X,Y)
X = ZZ(1)
Y = ZZ(2)
RETURN
END
or
SUBROUTINE ROT2D (C, S, X, Y)
ZZ = C*X - S*Y
Y = S*X + C*Y
X = ZZ
RETURN
END
(from the viewpoint of the algorithm graph, these subroutines are equivalent).
The subroutine for calculating the rotation parameters can have the following form:
SUBROUTINE PARAMS (X, Y, C, S)
Z = MAX (ABS(X), ABS(Y))
IF (Z.EQ.0.) THEN
C OR (Z.LE.OMEGA) WHERE OMEGA - COMPUTER ZERO
C = 1.
S = 0.
ELSE IF (Z.EQ.ABS(X))
R = Y/X
RR = R*R
RR2 = SQRT(1+RR)
X = X*RR2
Y = (1-RR2)/R
C = 1./RR2
S = -C*R
ELSE
R = X/Y
RR = R*R
RR2 = SQRT (1+RR)
X = SIGN(Y, X)*RR
Y = R - SIGN(RR,R)
S = SIGN(1./RR2, R)
C = S*R
END IF
RETURN
END
In the above implementation, the rotation parameter t is written to a vacant location (the entry of the modified matrix with the corresponding indices is known to be zero). This makes it possible to readily reconstruct rotation matrices (if required).
2.2 Locality of data and computations
2.2.1 Locality of implementation
2.2.1.1 Structure of memory access and a qualitative estimation of locality
Figure 7 presents the memory access profile for an implementation of the real version of the QR decomposition of a square matrix by Givens method. This profile is formed of accesses to a single two-dimensional array storing matrix values. The profile consists of iterations of the same kind, which is clearly seen from the graph. The [math]i[/math]-th iteration affects the array entries with indices beginning from [math](i-1)*k[/math]; that is, after each iteration, the first [math]k[/math] entries are no longer processed (the value of [math]k[/math] is not known at the moment). Each iteration consists of two parts performed in parallel, namely, the successive sorting of all the entries beginning from the entry [math](i-1)*k[/math] and the active use of the first [math]i*k[/math] entries.
Judging from the general picture, one can say that the locality of this profile is fairly high. Indeed, accesses to the entries close in memory are also close in program, and there are well localized sections where the data are frequently used repeatedly. However, a more detailed analysis is needed for verifying these observations.
A fragment of the general profile (set out in green) is shown in fig. 8. It can be seen that, at each iteration, both parallel processes consist of small pieces resembling the conventional successive sorting. One can clearly recognize a regular structure: all the pieces have the same size and are separated by the same distance.
Let us look at the fragment in fig. 8 more closely. Here, individual accesses can be seen, and it is safe to say that each piece is the successive sorting of a small number of entries. At the upper part of each new piece, new entries are sorted, while, at the lower part, the same data are processed. Thus, on the whole, this fragment has a high spatial locality (both parts are successive sortings) but a medium temporal locality (because the temporal locality of the upper part is very low, while the one of the lower part is very high).
The general profile is a collection of such fragments. According to figures 7 and 8, these fragments use the data repeatedly at different iterations; consequently, the general profile is likely of higher temporal locality. On the whole, one can say that the memory accesses in this program have a high spatial locality and a satisfactory temporal locality.
2.2.1.2 Quantitative estimation of locality
The basic fragment of the implementation used for obtaining quantitative estimates is given here (функция Kernel). The start-up parameters are described here
The first estimate is produced on the basis of daps, which estimates the number of memory accesses (reading and writing) per second. This characteristic, used for the interaction with memory, is an analog of the flops estimate. It largely estimates the performance of this interaction rather than locality. Nevertheless, this characteristic (in particular, its comparison with the next characteristic cvg) is a good source of information.
Figure 10 presents the values of daps for implementations of popular algorithms. They are arranged in increasing order (in general, the larger daps, the higher efficiency). It can be seen that the program under discussion is sufficiently efficient in its interaction with memory, which is consistent with our analysis of locality
The second characteristic, called cvg, is intended for obtaining a locality estimate that would be more machine independent. It determines how often a program needs to fetch data to the cache memory. Accordingly, smaller values of cvg correspond to less often fetch operations and better locality.
Figure 11 presents the values of cvg for the same set of implementations. They are arranged in decreasing order (in general, the smaller cvg, the higher locality). It can be seen that, similarly to daps, cvg shows a fairly good result. This confirms the qualitative estimate discussed above.
2.3 Possible methods and considerations for parallel implementation of the algorithm
On cluster-like supercomputers, a partitioning of an algorithm (with the usage of coordinate generalized schedule of the algorithm graph) and its distribution between different nodes (MPI) are fairly possible. At the same time, the inner parallelism in blocks will partially realize multi-core node capabilities (OpenMP) and even core superscalarity. This should be the aim of an implementation judging by the structure of the graph algorithm.
2.4 Scalability of the algorithm and its implementations
2.4.1 Scalability of the algorithm
The structure of the graph makes one think that a good scalability can be attained, although, at the moment, this guess cannot be verified through concrete implementations. The favorable factors are a good (linear) nature of the critical path and a good coordinate structure of generalized schedule, which permits a good block partitioning of the algorithm graph.
2.4.2 Scalability of the algorithm implementation
Whether or not implementations of this algorithm are well-scalable cannot presently be easily verified for the natural reasons that are described below. We extend this section after an appropriate implementation will be produced.
2.5 Dynamic characteristics and efficiency of the algorithm implementation
Dynamic characteristics and performance data for the algorithm implementations cannot presently be easily obtained for the natural reasons that are described below. We extend this section after an appropriate implementation will be produced.
2.6 Conclusions for different classes of computer architecture
The module for calculating rotation parameters has a complicated logical structure. Therefore, we recommend the reader not to use PLDs as accelerators of universal processors for the reason that a significant part of their equipment will be idle. Certain difficulties are also possible on architectures with universal processors; however, the general structure of the algorithm makes it possible to expect a good performance from such architectures. For instance, on cluster-like supercomputers, a partitioning of the algorithm and its distribution between different nodes are fairly possible. At the same time, the inner parallelism in blocks will partially realize multi-core node capabilities and even core superscalarity.
2.7 Existing implementations of the algorithm
An unexpected result was obtained by a search through the available program packages[3]. Most of old (serial) packages contain a subroutine for the QR decomposition via Givens method. By contrast, new packages, intended for parallel execution, perform the QR decomposition solely on the basis of Householder (reflections) method. Meanwhile, parallel properties of the latter are inferior to those of Givens method. We attribute this fact to that Householder method can easily be written in terms of BLAS-procedures, and it does not require loop reordering (except for the case where a partitioning is used; however, in this case, the coordinate rather than skewed loop partitioning is applied). On the other hand, some loop rewriting is needed for the parallel implementation of Givens method because skewed parallelism should be applied to achieve the maximum degree in its parallelization. Thus, we cannot assess the scalability of this algorithm implementations for the reason that no such implementations exist at the moment.
3 References
- ↑ V.V. Voevodin, Yu.A. Kuznetsov. Matrices and Computations. Moscow, Nauka, 1984 (in Russian).
- ↑ V.V. Voevodin. Computational Foundations of Linear Algebra. Moscow, Nauka, 1977 (in Russian).
- ↑ A.V. Frolov, A.S. Antonov, Vl.V. Voevodin, A.M. Teplov. Comparison of different methods for solving the same problem on the basis of Algowiki methodology (in Russian) // Submitted to the conference "Parallel computational technologies (PaVT'2016)".