A checked version of this page, approved on 4 July 2022, was based on this revision.

Primary author of this description: A.N.Daryin.

1 Links
2 Locality of data and computations
- 2.1 Locality of implementation
  - 2.1.1 Structure of memory access and a qualitative estimation of locality
  - 2.1.2 Quantitative estimation of locality
3 Scalability of the algorithm and its implementations
- 3.1 Scalability of the algorithm
- 3.2 Scalability of of the algorithm implementation
4 Dynamic characteristics and efficiency of the algorithm implementation
5 Run results
6 References

1 Links

The parallel implementation in C language under discussion. This study was conducted using the Lomonosov supercomputer of the ^[1] Moscow University Supercomputing Center.

2 Locality of data and computations

2.1 Locality of implementation

2.1.1 Structure of memory access and a qualitative estimation of locality

2.1.2 Quantitative estimation of locality

3 Scalability of the algorithm and its implementations

3.1 Scalability of the algorithm

3.2 Scalability of of the algorithm implementation

Let us study scalability for the parallel implementation of Dijkstra's algorithm in accordance with Scalability methodology.

Variable parameters for the start-up of the algorithm implementation and the limits of parameter variations:

number of processors [4, 8 : 128] with the squares of integers;
graph size [16000:64000] with the step 16000.

The following figure shows the graph of the performance of the chosen implementation as a function of the variable start-up parameters.

Figure 1. Parallel implementation of Dijkstra's algorithm. Performance as a function of the number of processors and the area size.

In view of the features of this parallel implementation of Dijkstra's algorithm, the overall performance is rather low. With the growth in the number of processors, the performance improves slowly and even decreases when this number approaches 128. This is explained by the use of collective operations at each iteration of the algorithm and the fact that the communication expenses grow significantly with the increasing number of processors. On each processor, computations are performed fairly fast; consequently, the decomposition of the graph hardly compensates the communication expenses.

4 Dynamic characteristics and efficiency of the algorithm implementation

We used Intel Xeon X5570 processors with the peak performance 94 Gflops and the Intel 13.1.0 compiler. The figures illustrate the efficiency of this implementation for 32 processors.

Figure 2. Graph of CPU loading while executing Dijkstra's algorithm

The graph of CPU loading shows that the loading is about 50 percent almost all the time. This indicates that the processors are loaded uniformly when 8 processes per computational node are executed and Hyper Threading is not used.

Figure 3. Graph of the number of floating-point operations per second while executing Dijkstra's algorithm

Figure 3 shows the graph of the number of floating-point operations per second. It is seen that the overall performance is very low; namely, the peak performance is approximately 250 Kflops, and the performance averaged over all the nodes is about 150 Kflops. This indicates that almost all the calculations in the program are performed on integers.

Figure 4. Graph of L1 cache misses per second while executing Dijkstra's algorithm

The graph of L1 cache misses shows that, for several cores, the number of misses is very large. This number is on the level 15 millions per second (the peak values are up to 60 mln/sec), which indicates intensive calculations within some of processes. The number of misses averaged over all the nodes is considerably lower (about 9 mln/sec). This shows that calculations are distributed non-uniformly.

Figure 5. Graph of L3 cache misses per second while executing Dijkstra's algorithm

The graph of L3 cache misses shows that this number is very low (about 1,2 mln/sec); however, the value averaged over all the nodes is approximately 0,5 mln/sec. The ratio L1|L3 of cache misses for processes with high performance can be as large as 60, but its average value is about 30. This indicates a very good locality of calculations both of some processes and (on the average) of all the processes, which is an evidence of high performance.

Figure 6. Graph of the number of RAM reads while executing Dijkstra's algorithm

The picture shown in the graph of memory accesses is fairly typical for such applications. The activity of reading is rather low; in combination with low values of L3 cache misses, this indicates good locality. The good locality also indicates that, for this problem, the value about 1 mln/sec is the result of high computational performance, although there is some non-uniformity between the processes.

Figure 7. Graph of the number of RAM writes while executing Dijkstra's algorithm

The graph of memory writes demonstrates a similar picture of non-uniform computations: at one and the same time, only several processes actively perform writes. This correlates with other graphs. One should note a rather low number of memory writes, which indicates a good organization of computations and the sufficiently efficient memory access performance.

Figure 8. Graph of the data rate through Infiniband network (in bytes per second) while executing Dijkstra's algorithm

The graph of the data rate through Infiniband network shows a fairly high rate measured in bytes per second. This says that the processes communicate intensively and the data portions exchanged are probably rather small because the computational performance is high. It should be noted that the data rate is different for different processes, which indicates an imbalance in computations.

Figure 9. Graph of the data rate through Infiniband network (in packets per second) while executing Dijkstra's algorithm

The graph of the data rate measured in packets per second demonstrates an extremely high intensity of data communication. This probably says that the processes very intensively exchange with not very large amounts of data. At each step, collective operations with small portions of data are used, which explains the above observations. The imbalance between the processes is less than those in the graphs of memory use, computations, and data rate in bytes per second. This indicates that the distinct processes exchange with the same number of packets; however, they obtain different amounts of data and perform non-uniform computations.

Figure 10. Graph of the number of processes expecting the beginning of the calculation stage (Loadavg) while executing Dijkstra's algorithm

The graph of the number of processes expecting the beginning of the calculation stage (Loadavg) shows that the value of this parameter in the course of the program execution is a constant, which is approximately equal to 8. This indicates that the program performs stably and all the nodes are loaded with calculations. This also testifies to a very rational and static loading of the hardware resources, as well as a reasonable efficiency of the implementation under study. On the whole, the data of system monitoring make it possible to conclude that the program functioned rather efficiently and stably. The use of memory is very intensive, and the use of communication environment is extremely intensive, whereas the amounts of the transmitted data are not large. For the algorithmic side of the program, this indicates the insistence on the latency of communication environment. To all appearance, the low efficiency is caused by a fairly high amount of transmission from each process and intensive exchanges by messages.

5 Run results

6 References

↑ Воеводин Вл., Жуматий С., Соболев С., Антонов А., Брызгалов П., Никитенко Д., Стефанов К., Воеводин Вад. Практика суперкомпьютера «Ломоносов» // Открытые системы, 2012, N 7, С. 36-39.

[Lom-1] Воеводин Вл., Жуматий С., Соболев С., Антонов А., Брызгалов П., Никитенко Д., Стефанов К., Воеводин Вад. Практика суперкомпьютера «Ломоносов» // Открытые системы, 2012, N 7, С. 36-39.

[1]

Dijkstra, Google

Contents

1 Links

2 Locality of data and computations

2.1 Locality of implementation

2.1.1 Structure of memory access and a qualitative estimation of locality

2.1.2 Quantitative estimation of locality

3 Scalability of the algorithm and its implementations

3.1 Scalability of the algorithm

3.2 Scalability of of the algorithm implementation

4 Dynamic characteristics and efficiency of the algorithm implementation

5 Run results

6 References

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

File storage

Tools

In other languages