Implementation level

Householder (reflections) method for reducing a symmetric matrix to tridiagonal form, locality

From Algowiki
Jump to navigation Jump to search


1 Links

The main fragment of the implementation, on the basis of which quantitative estimates were obtained, is given here.

2 Locality of data and computations

According to the graph shown in the figure, there is, unfortunately, a lot of data messaging in the algorithm. Inevitably, some of the arcs in the graph remain long, which negatively affects spatial locality.

2.1 Locality of implementation

2.1.1 Structure of memory access and a qualitative estimation of locality

Figure 1. The Householder (reflection) method for reducing symmetric matrices to tri-diagonal form. Overall memory access profile

Fig.1 shows the memory access profile for an implementation of the Householder (reflection) method. This method reduces a symmetric matrix to tri-diagonal form. The profile obviously has an iterative structure, and, as one can see, its iterations are very similar to each other. The basic difference between iterations is the data set used. At each iteration step, several first elements are dropped from the analysis; that is, the greater the index of iteration, the less data are used. It is also seen that the number of memory accesses slightly decreases at each iteration step. It is sufficient to analyze one iteration in order to understand the overall profile. Let us consider the very first iteration step in more detail.

Fig.2 shows the memory access profile for the first iteration step (in fig.1, it is indicated in green). It can be partitioned into several fragments treated separately. The structure of fragments 2-5 can be understood from the overall graph of the iteration. Fragment 2 is a collection of serial searches performed with a small step in the reverse order. The number of elements inside a search gradually decreases. Such a structure is characterized by a high spatial locality because closely spaced data are often accessed. On the other hand, the temporal locality is low since the data are not used repeatedly. The same is true of fragment 4, whose main difference from fragment 2 is that the searches are done in the direct order. Note that a certain deformation of the former compared with the latter is caused not by the structure of this fragment but by the different number of accesses performed in parallel (for instance, see the right area of fragment 6).

Figure 2. Memory access profile, one iteration

Fragments 3 and 5 consist of a small number of accesses to data located far from each other. Such fragments are characterized by a low spatial and temporal locality.

A more detailed examination is needed for the remaining fragments. Let us inspect fragment 1, which is as a whole shown in Fig.3. Here, we can see several stages; each is a serial search through elements with memory step 1. In certain cases, the data are many times used repeatedly (this is especially noticeable in the lower right area of the figure, where a number of accesses to the same element are performed). If fragment 1 only involves about 40 elements, we can say that this set of accesses has a high spatial and temporal locality.


Figure 3. Memory access profile, one iteration, fragment 1

Now, we consider fragment 6 in detail (see fig.4). Here, we see that repeated accesses to data are performed after certain intervals rather than in succession. However, the number of elements involved is again very small; therefore, in this case as well, the locality (both spatial and temporal) is high.

Figure 4. Memory access profile, one iteration, fragment 6

On the whole, it can be said that this iteration step has a fairly high spatial and temporal locality. Since other steps have a similar construction, this conclusion can be extended to the overall memory access profile.

2.1.2 Quantitative estimation of locality

The estimate is based on daps, which assesses the number of memory accesses (reads and writes) per second. Similar to flops, daps is used to evaluate memory access performance rather than locality. Yet, it is a good source of information, particularly for comparison with the results provided by the estimate cvg.

Fig.5 shows daps values for implementations of popular algorithms, sorted in ascending order (the higher the daps, the better the performance in general). It is seen that, here, the performance of memory interaction is sufficiently high. Indeed, the daps value is only slightly smaller than that for the Linpack benchmark. The latter has a fairly high performance, which is slightly higher than that for implementations of the Jacobi method or Gaussian elimination.

Figure 5. Comparison of daps values

3 Scalability of the algorithm and its implementations

3.1 Scalability of the algorithm

3.2 Scalability of of the algorithm implementation

4 Dynamic characteristics and efficiency of the algorithm implementation

5 Run results