Algorithm level

Difference between revisions of "Dot product"

From Algowiki
Jump to navigation Jump to search
[unchecked revision][checked revision]
 
(9 intermediate revisions by the same user not shown)
Line 1: Line 1:
Основные авторы описания: [[Участник:Frolov|А.В.Фролов]], [[Участник:VadimVV|Вад.В.Воеводин]] ([[#Описание локальности данных и вычислений|раздел 2.2]]), [[Участник:Teplov|А.М.Теплов]] (раздел [[#Масштабируемость алгоритма и его реализации|2.4]])
+
{{level-a}}
  
== Описание свойств и структуры алгоритма ==
+
Primary authors of this description: [[:ru:Участник:Frolov|A.V.Frolov]].
  
=== Общее описание алгоритма ===
+
== Properties and structure of the algorithm ==
  
'''Скалярное произведение векторов''' используется в качестве одной из базовых операций в широком круге методов. При этом используется как в версии скалярного произведения собственно <math>n</math>-мерных векторов (одномерных массивов размера <math>n</math>), так и в версии скалярного произведения строк, столбцов и других линейных подмножеств массивов большей размерности. Последняя отличается от первой тем, что соответствующая подпрограмма получает, кроме стартовых адресов векторов, также и параметры смещения следующих элементов относительно предыдущих (в первой версии эти смещения равны 1). Разные формулы существуют для скалярных произведений в вещественной арифметике и для комплексных векторов. Здесь мы рассматриваем только вещественную арифметику и последовательно-параллельную реализацию.
+
=== General description of the algorithm ===
  
=== Математическое описание ===
+
'''The dot product of vectors''' is one of the basic operations in a number of methods. It is used in two versions: as the proper dot product of <math>n</math>-dimensional vectors (one-dimensional arrays of size <math>n</math>) and as the scalar product of rows, columns, and other linear subsets of multidimensional arrays. The latter differs from the former in that the corresponding subroutine receives not only the initial addresses of vectors but also parameters of the displacement of the subsequent components with respect to the preceding ones (for the former version, all the displacements are equal to one). Formulas defining the dot product are different for real and complex arithmetic. Here, we deal only with real arithmetic and the serial-parallel implementation.
  
Исходные данные: два одномерных массива n чисел.
+
=== Mathematical description of the algorithm ===
  
Вычисляемые данные: сумма попарных произведений элементов массива.
+
Input data: two one-dimensional numeric arrays of size n.
  
Формулы метода:  
+
Output data: the sum of pairwise products of the corresponding elements of both arrays.
число <math>n</math> разлагается в выражение типа
+
 
 +
Formulas of the method:
 +
the number <math>n</math> is expressed in the form
 
<math>n = (p - 1) k + q</math>,  
 
<math>n = (p - 1) k + q</math>,  
где <math>p</math> — количество процессоров, <math>k = \lceil \frac{n}{p} \rceil</math>,
+
where <math>p</math> is the number of processors, <math>k = \lceil \frac{n}{p} \rceil</math>, and
<math>q = n - k (p - 1)</math>.
+
<math>q = n - k (p - 1)</math>. Then the <math>i</math>-th processor (<math>i < p</math>) calculates in the serial mode the partial dot product of the subarrays formed of the elements with indices from <math>(i - 1) k + 1</math> to <math>ik</math>.
После этого на <math>i</math>-м процессоре (<math>i < p</math>) последовательно вычисляется «частичное» скалярное произведение подмассивов, начиная с <math>(i - 1) k + 1</math>-го номера элемента, до <math>ik</math>-го номера.
 
  
 
:<math>S_i = \sum_{j = 1}^k a_{k (i - 1) + j} b_{k (i - 1) + j}</math>
 
:<math>S_i = \sum_{j = 1}^k a_{k (i - 1) + j} b_{k (i - 1) + j}</math>
  
На <math>p</math>-м процессоре последовательно вычисляется «частичное» скалярное произведение подмассивов, начиная с <math>(p - 1) k + 1</math>-го номера элемента до <math>(p - 1) k + q</math>-го номера.
+
The <math>p</math>-th processor calculates in the serial mode the partial dot product of the subarrays formed of the elements with indices from <math>(p - 1) k + 1</math> to <math>(p - 1) k + q</math>.
  
 
:<math>S_p = \sum_{j = 1}^q a_{k (p - 1) + j} b_{k (p - 1) + j}</math>
 
:<math>S_p = \sum_{j = 1}^q a_{k (p - 1) + j} b_{k (p - 1) + j}</math>
  
По окончании этого процесса процессоры обмениваются данными и на одном из них (либо на всех одновременно, если результат нужен далее на всех процессорах) получившиеся суммы суммируются последовательно друг с другом.
+
Upon ending this process, the processors exchange their data, and then one of them (or all of them simultaneously if they need the result later on) add the partial sums in the serial mode.  
  
 
:<math>\sum_{i = 1}^p S_i</math>
 
:<math>\sum_{i = 1}^p S_i</math>
  
При этом в последовательно-параллельном варианте при вычислений сумм из формул используется последовательный порядок суммирования (обычно от меньших индексов к большим).  
+
In the serial-parallel version, all the sums are calculated in the serial mode (usually in the increasing order of indices).  
  
=== Вычислительное ядро алгоритма ===
+
=== Computational kernel of the algorithm ===
  
Вычислительное ядро скалярного произведения в последовательно-параллельном варианте можно представить как <math>p</math> вычислений «частных» скалярных произведений c последующим последовательным суммированием получившихся <math>p</math> чисел.
+
The computational kernel of the dot product in the serial-parallel version can be represented as <math>p</math> calculations of partial dot products with the subsequent serial summation of <math>p</math> partial results.
  
=== Макроструктура алгоритма ===
+
=== Macro structure of the algorithm ===
  
Как уже записано в описании ядра алгоритма, основную часть вычисления скалярного произведения составляют параллельное вычисление скалярных произведений меньшей размерности последовательным методом и последовательное вычисление суммы получившихся «частных» скалярных произведений подмассивов.
+
As already noted in the description of the kernel, the basic part of the algorithm consists of the parallel computation of partial dot products, calculated in the serial mode, and the subsequent summation of these partial dot products.  
  
=== Описание схемы реализации последовательного алгоритма ===
+
=== Implementation scheme of the serial algorithm ===
  
Формулы метода описаны выше. Последовательность исполнения суммирования может быть разная — как по возрастанию, так и по убыванию индексов. Обычно без особых причин порядок не меняют, используя естественный (возрастание индексов).
+
The formulas of the method are given above. The order of summation can be different, both in the increasing and decreasing order of indices. If no special reason is pursued, then, usually, the natural order of increasing indices is used.  
  
=== Последовательная сложность алгоритма ===
+
=== Serial complexity of the algorithm ===
  
Для вычисления скалярного произведения массивов, состоящих из <math>n</math> элементов, при любых разложениях количество операций умножения неизменно и равно <math>n</math>, а количество операций сложения равно <math>n - 1</math>. Поэтому алгоритм должен быть отнесён к алгоритмам ''линейной'' сложности по количеству последовательных операций.
+
In any implementation of the dot product for arrays consisting of <math>n</math> elements the number of scalar multiplications is invariably equal to <math>n</math>, while the number of additions is <math>n - 1</math>. Consequently, in terms of the number of serial operations, the algorithm should be qualified as a ''linear'' complexity algorithm.
  
=== Информационный граф ===
+
=== Information graph ===
  
Опишем граф алгоритма в виде рисунка. Однако следует отметить, что в большинстве случаев программисты не экономят на одном вызове операции сложения, а инициализируют начальное значение переменной нулём. В этом случае граф становится таким, как на втором рисунке (n=24).
+
We describe the graph of the algorithm graphically (see the first figure). It should be noted, however, that, in most cases, programmers prefer to assign zero initial values to variables chosen for summation, which increases the number of additions by one. In such cases, the graph is as shown in the second figure (n=24).
  
 
{| align="left"
 
{| align="left"
 
     |- valign="top"
 
     |- valign="top"
     | [[file:series-parallel_dot_product_graph.png|thumb|750px|Последовательно-параллельное вычисление скалярного произведения с экономией операций сложения]]
+
     | [[file:series-parallel_dot_product_graph.png|thumb|750px|Figure 1. Serial-parallel calculation of the dot product with saving in the number of additions]]
     | [[file:Series-parallel_dot_product_graph_straight.png|thumb|790px|Последовательно-параллельное вычисление скалярного произведения без экономии операций сложения]]
+
     | [[file:Series-parallel_dot_product_graph_straight.png|thumb|790px|Figure 2. Serial-parallel calculation of the dot product without saving in the number of additions]]
 
|}
 
|}
  
=== Описание ресурса параллелизма алгоритма ===
+
=== Parallelization resource of the algorithm ===
 +
 
 +
The parallel version of the serial-parallel method for calculating the dot product of arrays of size <math>n</math> requires that the following layers be successively executed:
 +
* 1 layer of calculating pairwise products,
 +
* <math>k - 1</math> layers of summation for partial dot products (<math>p</math> branches),
 +
* <math>p - 1</math> layers of summation (one serial branch).
 +
 
 +
Therefore, the critical path of the algorithm in the parallel version (and the corresponding height of the CPF) depends on the chosen partition of the arrays. In the optimal case (<math>p = \sqrt{n}</math>), the height of the CPF is <math> 2 \sqrt{n} - 1</math>. Thus, in terms of the height of the CPF, the serial-parallel method is qualified as a square-root complexity algorithm. In terms of the width of the CPF, it is also a square-root complexity algorithm. 
 +
 
 +
=== Input and output data of the algorithm ===
 +
 
 +
Input data: array<math>a</math> (with elements <math>a_i</math>), array <math>b</math> (with elements <math>b_i</math>).
 +
 
 +
Further restrictions: none.
 +
 
 +
Size of the input data: <nowiki/><math>2 n</math>.
 +
 
 +
Output data: the sum of pairwise products of the corresponding elements of two arrays.
 +
 
 +
Size of the output data: single scalar.
 +
 
 +
=== Properties of the algorithm ===
 +
 
 +
It is clear that, in the case of unlimited resources, the ratio of the serial to parallel complexity is expressed by the ''square root'' of n (since this is the ratio of the linear to square-root complexity). The computational power of the algorithm, understood as the ratio of the number of operations to the total size of the input and output data, is only ''1 (the size of the input and output data is almost equal to the number of operations; more exactly, the former number exceeds the latter by 2)''. For a given partition of <math>n</math>, the algorithm is completely determined. The edges of the information graph are local. In a number of algorithms using dot products of single precision and the accumulation mode, these products are calculated in double precision in order to reduce round-off errors. However, even without accumulation mode, the serial-parallel method for calculating the dot product reduces the effect of round-off errors <math>\sqrt{n}</math>-fold "on the average".
  
Для вычисления скалярного произведения массивов порядка <math>n</math> последовательно-параллельным методом в параллельном варианте требуется последовательно выполнить следующие ярусы:
+
== Software implementation of the algorithm  ==
* 1 ярус вычисления произведений,
 
* <math>k - 1</math> ярусов суммирования по частям массивов (<math>p</math> ветвей),
 
* <math>p - 1</math> ярусов суммирования (одна последовательная ветвь).
 
  
Таким образом, в параллельном варианте критический путь алгоритма  (и соответствующая ему высота ЯПФ) будет зависеть от произведённого разбиения массива на части. В оптимальном случае (<math>p = \sqrt{n}</math>) высота ЯПФ будет равна <math> 2 \sqrt{n} - 1</math>. При классификации по высоте ЯПФ, таким образом, последовательно-параллельный метод относится к алгоритмам ''со сложностью «корень квадратный»''. При классификации по ширине ЯПФ его сложность также будет ''«корень квадратный»''.
+
=== Implementation peculiarities of the serial algorithm ===
  
=== Описание входных и выходных данных ===
+
The simplest variation (without a summation inversion) in Fortran can be written as follows:
  
Входные данные: массивы <math>a</math> (элементы <math>a_i</math>), <math>b</math> (элементы <math>b_i</math>).
+
<source lang="fortran">
 +
DO  I = 1, P
 +
S (I) = A(K*(I-1)+1)*B(K*(I-1)+1)
 +
IF (I.LQ.P) THEN
 +
DO J = 2,K
 +
S(I)=S(I)+A(K*(I-1)+J)*B(K*(I-1)+J)
 +
            END DO
 +
ELSE
 +
DO J = 2,Q
 +
S(I)=S(I)+A(K*(I-1)+J)*B(K*(I-1)+J)
 +
            END DO
 +
END IF
 +
END DO
 +
SCP = S(1)
 +
DO I = 2, P
 +
SCP = SCP + S(I)
 +
END DO
 +
</source>
  
Дополнительные ограничения: отсутствуют.
+
Similar versions can be written with an inverted summation. It should be noted that the algorithm chart [[#Information graph|is the same for both versions]]! The initial cycle body as a whole can be replaced with a call to the scalar product function, if its serial version has been implemented.
  
Объём входных данных: <nowiki/><math>2 n</math>.
+
=== Possible methods and considerations for parallel implementation of the algorithm ===
  
Выходные данные: сумма попарных произведений элементов массивов.
+
In addition to [[#Implementation peculiarities of the sequential algorithm|the simplest implementation described above]], there are more complex codes for implementing the same algorithm. It should be noted that a number of the implementations (in the same BLAS) use the decomposition of <math>n</math> into a small and large number. In this case, nested cycles are not used, as the summation of a small number of multiplication operations is performed “manually” by expanding the initial cycle. Some implementations of the serial-parallel computation method are not written as individual subprograms, but scattered around the code producing scalar multiplication, despite effectively representing this kind of implementation. Examples of this approach include block implementations of various decompositions (Cholesky decomposition, Gaussian decomposition, etc.).
  
Объём выходных данных: один скаляр.
+
=== Run results ===
 +
=== Conclusions for different classes of computer architecture ===
  
=== Свойства алгоритма ===
+
== References ==
  
Соотношение последовательной и параллельной сложности в случае неограниченных ресурсов, как хорошо видно, является ''корнем квадратным'' (отношение линейной к корню квадратному). При этом вычислительная мощность алгоритма, как отношение числа операций к суммарному объему входных и выходных данных — всего-навсего ''1 (входных и выходных данных почти столько же, сколько операций; если точнее - даже больше на 2)''. При этом алгоритм полностью детерминирован при заданном разложении <math>n</math>. Дуги информационного графа локальны. Для уменьшения ошибок округления режимом накопления в ряде алгоритмов, использующих скалярное произведение одинарной точности, оно вычисляется с двойной точностью. Впрочем, у последовательно-параллельного способа вычисления скалярного произведения и без режима накопления влияние ошибок округления «в среднем» меньше в <math>\sqrt{n}</math> раз.
+
<references />
  
[[Category:Started articles]]
+
[[Category:Finished articles]]
  
 
[[Ru:Скалярное произведение векторов, вещественная версия, последовательно-параллельный вариант]]
 
[[Ru:Скалярное произведение векторов, вещественная версия, последовательно-параллельный вариант]]

Latest revision as of 14:49, 8 July 2022


Primary authors of this description: A.V.Frolov.

1 Properties and structure of the algorithm

1.1 General description of the algorithm

The dot product of vectors is one of the basic operations in a number of methods. It is used in two versions: as the proper dot product of [math]n[/math]-dimensional vectors (one-dimensional arrays of size [math]n[/math]) and as the scalar product of rows, columns, and other linear subsets of multidimensional arrays. The latter differs from the former in that the corresponding subroutine receives not only the initial addresses of vectors but also parameters of the displacement of the subsequent components with respect to the preceding ones (for the former version, all the displacements are equal to one). Formulas defining the dot product are different for real and complex arithmetic. Here, we deal only with real arithmetic and the serial-parallel implementation.

1.2 Mathematical description of the algorithm

Input data: two one-dimensional numeric arrays of size n.

Output data: the sum of pairwise products of the corresponding elements of both arrays.

Formulas of the method: the number [math]n[/math] is expressed in the form [math]n = (p - 1) k + q[/math], where [math]p[/math] is the number of processors, [math]k = \lceil \frac{n}{p} \rceil[/math], and [math]q = n - k (p - 1)[/math]. Then the [math]i[/math]-th processor ([math]i \lt p[/math]) calculates in the serial mode the partial dot product of the subarrays formed of the elements with indices from [math](i - 1) k + 1[/math] to [math]ik[/math].

[math]S_i = \sum_{j = 1}^k a_{k (i - 1) + j} b_{k (i - 1) + j}[/math]

The [math]p[/math]-th processor calculates in the serial mode the partial dot product of the subarrays formed of the elements with indices from [math](p - 1) k + 1[/math] to [math](p - 1) k + q[/math].

[math]S_p = \sum_{j = 1}^q a_{k (p - 1) + j} b_{k (p - 1) + j}[/math]

Upon ending this process, the processors exchange their data, and then one of them (or all of them simultaneously if they need the result later on) add the partial sums in the serial mode.

[math]\sum_{i = 1}^p S_i[/math]

In the serial-parallel version, all the sums are calculated in the serial mode (usually in the increasing order of indices).

1.3 Computational kernel of the algorithm

The computational kernel of the dot product in the serial-parallel version can be represented as [math]p[/math] calculations of partial dot products with the subsequent serial summation of [math]p[/math] partial results.

1.4 Macro structure of the algorithm

As already noted in the description of the kernel, the basic part of the algorithm consists of the parallel computation of partial dot products, calculated in the serial mode, and the subsequent summation of these partial dot products.

1.5 Implementation scheme of the serial algorithm

The formulas of the method are given above. The order of summation can be different, both in the increasing and decreasing order of indices. If no special reason is pursued, then, usually, the natural order of increasing indices is used.

1.6 Serial complexity of the algorithm

In any implementation of the dot product for arrays consisting of [math]n[/math] elements the number of scalar multiplications is invariably equal to [math]n[/math], while the number of additions is [math]n - 1[/math]. Consequently, in terms of the number of serial operations, the algorithm should be qualified as a linear complexity algorithm.

1.7 Information graph

We describe the graph of the algorithm graphically (see the first figure). It should be noted, however, that, in most cases, programmers prefer to assign zero initial values to variables chosen for summation, which increases the number of additions by one. In such cases, the graph is as shown in the second figure (n=24).

Figure 1. Serial-parallel calculation of the dot product with saving in the number of additions
Figure 2. Serial-parallel calculation of the dot product without saving in the number of additions

1.8 Parallelization resource of the algorithm

The parallel version of the serial-parallel method for calculating the dot product of arrays of size [math]n[/math] requires that the following layers be successively executed:

  • 1 layer of calculating pairwise products,
  • [math]k - 1[/math] layers of summation for partial dot products ([math]p[/math] branches),
  • [math]p - 1[/math] layers of summation (one serial branch).

Therefore, the critical path of the algorithm in the parallel version (and the corresponding height of the CPF) depends on the chosen partition of the arrays. In the optimal case ([math]p = \sqrt{n}[/math]), the height of the CPF is [math] 2 \sqrt{n} - 1[/math]. Thus, in terms of the height of the CPF, the serial-parallel method is qualified as a square-root complexity algorithm. In terms of the width of the CPF, it is also a square-root complexity algorithm.

1.9 Input and output data of the algorithm

Input data: array[math]a[/math] (with elements [math]a_i[/math]), array [math]b[/math] (with elements [math]b_i[/math]).

Further restrictions: none.

Size of the input data: [math]2 n[/math].

Output data: the sum of pairwise products of the corresponding elements of two arrays.

Size of the output data: single scalar.

1.10 Properties of the algorithm

It is clear that, in the case of unlimited resources, the ratio of the serial to parallel complexity is expressed by the square root of n (since this is the ratio of the linear to square-root complexity). The computational power of the algorithm, understood as the ratio of the number of operations to the total size of the input and output data, is only 1 (the size of the input and output data is almost equal to the number of operations; more exactly, the former number exceeds the latter by 2). For a given partition of [math]n[/math], the algorithm is completely determined. The edges of the information graph are local. In a number of algorithms using dot products of single precision and the accumulation mode, these products are calculated in double precision in order to reduce round-off errors. However, even without accumulation mode, the serial-parallel method for calculating the dot product reduces the effect of round-off errors [math]\sqrt{n}[/math]-fold "on the average".

2 Software implementation of the algorithm

2.1 Implementation peculiarities of the serial algorithm

The simplest variation (without a summation inversion) in Fortran can be written as follows:

	DO  I = 1, P
		S (I) = A(K*(I-1)+1)*B(K*(I-1)+1)
		IF (I.LQ.P) THEN
			DO J = 2,K
				S(I)=S(I)+A(K*(I-1)+J)*B(K*(I-1)+J)
		             END DO
		ELSE
			DO J = 2,Q
				S(I)=S(I)+A(K*(I-1)+J)*B(K*(I-1)+J)
		             END DO
		END IF
	END DO
	SCP = S(1)
	DO I = 2, P
		SCP = SCP + S(I)
	END DO

Similar versions can be written with an inverted summation. It should be noted that the algorithm chart is the same for both versions! The initial cycle body as a whole can be replaced with a call to the scalar product function, if its serial version has been implemented.

2.2 Possible methods and considerations for parallel implementation of the algorithm

In addition to the simplest implementation described above, there are more complex codes for implementing the same algorithm. It should be noted that a number of the implementations (in the same BLAS) use the decomposition of [math]n[/math] into a small and large number. In this case, nested cycles are not used, as the summation of a small number of multiplication operations is performed “manually” by expanding the initial cycle. Some implementations of the serial-parallel computation method are not written as individual subprograms, but scattered around the code producing scalar multiplication, despite effectively representing this kind of implementation. Examples of this approach include block implementations of various decompositions (Cholesky decomposition, Gaussian decomposition, etc.).

2.3 Run results

2.4 Conclusions for different classes of computer architecture

3 References