Implementation level

Dot product, locality

From Algowiki
Jump to navigation Jump to search


Primary author of this description: Vad.V.Voevodin (Section 2).

1 Links

The main implementation fragment, which is used in all quantitative assessments, is shown at here (KernelScalarSeqpar function).

2 Locality of data and computations

2.1 Locality of implementation

2.1.1 Structure of memory access and a qualitative estimation of locality

Figure 1. Scalar product of vectors. Overall memory address profile

Fig.1 shows the memory address profile for calculating the scalar product of vectors, floating-point version. This profile consists of three accessed arrays; the fragments for individual arrays are highlighted in green in Fig.1. Since we are looking at the serial implementation of the serial-parallel summation method, the profile has little dependence on the number of branches chosen – only the number of elements involved in fragment 1 will be different.

You can see that fragments 2 and 3 are identical and simply represent a serial search of all array elements. This profile is characterized by high spatial locality and very low temporal locality, as there are no recursive calls to elements.

Let’s look at fragment 1 in more detail, as shown in Fig.2. It is hard to see in the general profile shown in Fig.3, but by zooming in it is clear to see that this fragment consists of two similar serial searches of all array elements. In this case, the temporal locality is slightly better, since each element is called twice.

Figure 2. Fragment 1 (first array address profile)

2.1.2 Quantitative estimation of locality

Launch conditions are described in here.

The first estimate is based on daps, which assesses the number of memory access operations (reads and writes) per second. Similar to flops, this tool is used to evaluate memory access performance, rather than locality. But it is still a good source of information, particularly in combination with the next estimate – cvg.

Fig.3 shows daps values for implementations of common algorithms, sorted in ascending order (the higher the daps, the better the performance in general). It is clear that thanks to high spatial locality, performance of this program is rather good, on a par with a CG test from the NPB test set.

Figure 3. Comparison of daps assessment values

The second tool – cvg – is intended to obtain a more machine-independent locality assessment. It determines how often a program needs to pull data to cache memory. Respectively, the smaller the cvg value, the less frequently data needs to be called to cache, and the better the locality.

Fig.4 shows cvg values for the same set of implementations sorted in descending order (the smaller the cvg, the higher the locality in general). One can see that according to this assessment, the locality profile is similar to Triad or Sum tests, for example, from the STREAM test set. This looks reasonable because these tests also feature serial array searches.

Figure 4. Comparison of cvg assessment values

3 Scalability of the algorithm and its implementations

3.1 Scalability of the algorithm

3.2 Scalability of of the algorithm implementation

4 Dynamic characteristics and efficiency of the algorithm implementation

5 Run results