Implementation level

Dense matrix multiplication, scalability

From Algowiki
Jump to navigation Jump to search


Primary author of this description: A.M.Teplov (Section 3).

1 Links

Algorithm implementation in C.

2 Locality of data and computations

2.1 Locality of implementation

2.1.1 Structure of memory access and a qualitative estimation of locality

2.1.2 Quantitative estimation of locality

3 Scalability of the algorithm and its implementations

3.1 Scalability of the algorithm

3.2 Scalability of of the algorithm implementation

Figure 1. Parallel implementation of matrix multiplication. Maximum performance.
Figure 2. Parallel implementation of matrix multiplication. Maximum efficiency.

The set of variable launch parameters for implementing the algorithm and limits for algorithm parameter values:

  • Number of processors [4 : 1024]
  • Matrix size [1024 : 20480]

Efficiency of algorithm implementation execution

  • Minimum efficiency 4.71%
  • Maximum efficiency 31.72%

Scalability evaluation:

  • By the number of processes: –0.0436 – as the number of processes increases, efficiency falls rather intensively throughout the launch parameter change area. The efficiency reduction in the parallel program execution area is explained by the higher volumes of data to be sent between the processes and the respective increase in the overhead cost of organizing parallel computation. There is a limited area where efficiency increases with growth in the number of processes; however, further growth decreases the efficiency again. This is explained by data decomposition, which has a moment when matrix size allows the blocks to fit in cache memory. This is also confirmed by the same phenomenon, with a shift in the number of processes, with the growth of computational complexity.
  • By the size of the task: –0.0255 – as the task size grows, efficiency generally decreases over the area considered, though not as intensely as with growth in the number of processes. The efficiency reduction is explained by the fact that the volume of data transmitted increases greatly with the growth of computational complexity. At all matrix sizes considered, there is an area with higher efficiency due to the fact that with a smaller task size, the data fits well in cache memory, leading to higher application efficiency at a smaller task size. With further size increases, efficiency drops as the data size expands beyond the cache memory boundaries.
  • By two directions: – 0.000968 – when both task complexity and the number of processes are increased, efficiency is reduced, though at a very small rate. Combined with the fact that the difference between maximum and minimum efficiency at the considered parameter area is almost 25%, this shows that the efficiency reduction is uniform over the entire range of tasks, but changes intensively over some small areas. In other areas, efficiency changes are small, staying at approximately the same level.

4 Dynamic characteristics and efficiency of the algorithm implementation

5 Run results