Difference between revisions of "Single-qubit transform of a state vector"

[unchecked revision]

[checked revision]

Latest revision as of 15:56, 19 July 2022

Single-qubit transform of a state vector
Sequential algorithm
Serial complexity	[math]3 \cdot 2^n[/math]
Input data	[math]2^n+4[/math]
Output data	[math]2^n[/math]
Parallel algorithm
Parallel form height	[math]2[/math]
Parallel form width	[math]2^{n+1}[/math]

Primary authors of this description: A.Yu.Chernyavskiy.

1 Properties and structure of the algorithm

1.1 General description of the algorithm

The algorithm simulates an action of a single-qubit quantum gate on a state vector. ^[1] ^[2] ^[3]

The operations with small number of qubits (quantum bits) are the basis of quantum computations. Moreover, it's known, that arbitrary quantum computation can be represented by only single- and two-qubit operations. Such operations are the analogues of simple classical logic operations (like NOT, AND, XOR, etc), but have some serious differences, which lead to the simulation difficulties. The main difference is in the fact that even single-qubit operation affects all other qubits due to the phenomena of quantum entanglement. Such "non-locality" gives the huge computational power to the quantum computers, but makes it very difficult to simulate them.

The algorithm of single-qubit transform simulation is often a subroutine and is used iteratively to the different qubits of a state. For example, it's used for the quantum algorithms simulation and the analysis of quantum entanglement. As for the most of the classical algorithms in quantum information science, the described algorithm shows the exponential data growth with respect to the number of qubits. This fact leads to the necessity of a supercomputer usage for the practically important tasks.

1.2 Mathematical description of the algorithm

Input data:

Integer parameters [math]n[/math] (the number of qubits) and [math]k[/math] (the number of the affected qubit).

A complex [math]2 \times 2[/math] matrix [math]U = \begin{pmatrix} u_{00} & u_{01}\\ u_{10} & u_{11} \end{pmatrix}[/math] of a single-qubit transform.

A complex [math]2^n[/math]-dimensional vector [math]v,[/math] which denotes the initial state of a quantum system.

Output data: the complex [math]2^n[/math]-dimensional vector [math]w,[/math] which corresponds to the quantum state after the transformation.

The single-qubit transform can be represented as follows:

The output state after the action of a transform [math]U[/math] on the [math]k-[/math]th qubit is [math]v_{out} = I_{2^{k-1}}\otimes U \otimes I_{2^{n-k}},[/math] where [math]I_{j}[/math] is the [math]j-[/math]dimensional identity matrix, and [math]\otimes[/math] denotes the tensor (Kronecker) product.

However, the elements of the output vector can be represented in more computationally useful direct form:

[math] w_{i_1i_2\ldots i_k \ldots i_n} = \sum\limits_{j_k=0}^1 u_{i_k j} v_{i_1i_2\ldots j_k \ldots i_n} = u_{i_k 0} v_{i_1i_2\ldots 0_k \ldots i_n} + u_{i_k 1} v_{i_1i_2\ldots 1_k \ldots i_n} [/math]

A tuple-index [math]i_1i_2\ldots i_n[/math] denotes the binary form of an array index.

1.3 The computational kernel of the algorithm

The computational kernel of the single-qubit transform can be composed of [math]2^n[/math] calculations of the elements of the output vector [math]w.[/math] The computation of each element consists of two complex multiplication and one complex addition opearations. Moreover we need to compute an indexes [math]i_1i_2\ldots 0_k \ldots i_n,[/math] and the [math]i_k[/math] bits, these computations needs the bitwise operations.

1.4 Macro structure of the algorithm

As noted in the description of the computational kernel, the main part of the algorithm consists of the independent computations of the output vector elements.

1.5 Implementation scheme of the serial algorithm

For [math]i[/math] from [math]0[/math] to [math]2^n-1[/math]

Calculate the element [math]i_k[/math] of the binary representation of the index [math]i.[/math]
Calculate indexes [math]j[/math] with binary representation [math]i_1i_2\ldots \overline{i_k} \ldots i_n,[/math] where the overline denotes the bit-flip.
Calculate [math]w_i = u_{i_k i_k}\cdot v_{i} + u_{i_k \overline{i_k}}\cdot v_j.[/math]

1.6 Serial complexity of the algorithm

The following number of operations is needed for the single-qubit transform:

[math]2^{n+1}[/math] complex multiplications;
[math]2^n[/math] complex additions;
[math]2^n[/math] [math]k[/math]-th bit reads;
[math]2^n[/math] [math]k[/math]-th bit writes.

Note, that the algorithm are often used iteratively, that's the the bit operations (3-4) can be performed once in the beginning of computations. Moreover, they can be avoided by the usage of additions and bitwise multiplications with the [math]2^k,[/math] that can be also calculated once for each [math]k.[/math]

1.7 Information graph

The Figs. 1 and 2 shows the information graph for the paramters [math]n=3, k=1[/math] and [math]n=3, k=2.[/math] The graphs are presented without a transformation matrix [math]U,[/math] because its small size in comparison with a input and output data. Fig.3 shows the main operation, which is denoted by orange on Figs.1-2.

The representation with an input and data and the "folded" triple operation is useful for the understanding of the memory locality. Note, the the structure of an information graph strongly depends on the parameter [math]k.[/math]

Figure 1. The information graph for [math]n=3, k=1[/math] without transformation matrix [math]U.[/math]

Figure 2. The information graph for [math]n=3, k=2[/math] without transformation matrix [math]U.[/math]

Figure 3. The information graph of the main operation.

1.8 Parallelization resource of the algorithm

As we can see on the information graph, the direct algorithm of one-qubit transform simulation has the very high parallelization resource, because all computations of different output vector elements can be made independently. Two multiplication operations needed for the computation of each output vector element also can be done in parallel.

So, there are only two levels of computations:

[math]2^{n+1}[/math] multiplications.
[math]2^n[/math] additions.

1.9 Input and output data of the algorithm

Input data:

[math]2^n[/math]-dimensional complex state vector [math]u.[/math] In most cases thу vector is normalized to 1.
[math]2[/math]-dimensional unitary matrix [math]U[/math].
The number (index) of the affected qubit [math]q[/math].

Output data:

The [math]2^n[/math]-dimensional complex output state vector [math]w.[/math].

Amount of input data: [math]2^n+4[/math] complex numbers and [math]1[/math] integer parameter.

Amount of output data: [math]2^n[/math] complex numbers.

1.10 Properties of the algorithm

The ratio of the serial complexity to the parallel complexity is exponential (an exponent transforms to a constant).

The computational power of the algorithm considered as the ratio of the number of operations to the amount of input and output data is constant.

2 Software implementation of the algorithm

2.1 Implementation peculiarities of the serial algorithm

The function of the single-qubit transform may be represented in C language as:

void OneQubitEvolution(complexd *in, complexd *out, complexd U[2][2], int nqubits, int q)
{
	//n - number of qubits
        //q - index of the affected qubit

	int shift = nqubits-q;
	//All bits are zeros, except the bit correspondent to the affected qubit.  
	int pow2q=1<<(shift);

	int N=1<<nqubits;
	for	(int i=0; i<N; i++)
	{
		//Making the changing bit zero
		int i0 = i & ~pow2q;
		
		//Setting the changing bit value
		int i1 = i | pow2q;
		
		//Getting the changing bit value
		int iq = (i & pow2q) >> shift;

		out[i] = U[iq][0] * in[i0] + U[iq][1] *in[i1];
	}
}

Note, that the huge part of the computation and code logic consists of the bit operations, but this can be avoided. In most cases, single-qubit transform is just as a subroutine and is used sequentially many times. Moreover, indexes i0, i1, and iq depends only on the parameters n and q. The n is fixed, and we can calculate these indexes in the beginning of computations for every parameter q, and we need only 3n integers to store the data, when the input and output data growths exponentially. It's obvious that such optimization is critically needed for the implementations of the algorithm for the hardware and software with slow bit operations (for example, Matlab).

2.2 Possible methods and considerations for parallel implementation of the algorithm

The main method of the parallelization is obvious: we need to parallelize the main loop (independent computations of the elements of the output vector), and possibly parallelize in-loop multiplications, which are independent too, as can bee seen from Fig.3. For shared memory systems such method leads to the near ideal speed-up. However, for the distributed memory machines the algorithm has the unsolvable problems with data transfer, that can be seen (for example, from Fig. 8), for big tasks (with the big number of qubits n), the the amount of transfer operations becomes comparable to the arithmetical operation. The possible particular solution is the usage of caching or different programming paradigms like SHMEM. But such unlocality makes it impossible to achieve ideal efficiency on distributed memory machines.

The algorithm is implemented (mostly serial versions) in different libraries for quantum information science and quantum computer simulation. For example: QLib, libquantum, QuantumPlayground, LIQUiD.

2.3 Run results

2.4 Conclusions for different classes of computer architecture

Because of the very high parallelization possibility, but low locality, the effective and good scalable parallel implementation of the algorithm can be achieved for the shared memory. But the usage of distributed memory leads to the efficiency problems and necessity of special tricks for the optimization. It may be noted, that this task is a very good test platform for the development of methods for the data intense tasks with bad locality.

3 References

↑ Kronberg D. A., Ozhigov Yu. I., Chernyavskiy A. Yu. Algebraic tools for quantum information. (In Russian)
↑ Preskill J. Lecture notes for physics 229: Quantum information and computation //California Institute of Technology. – 1998.
↑ Nielsen, Michael A., and Isaac L. Chuang. Quantum computation and quantum information. Cambridge university press, 2010.

[1] Kronberg D. A., Ozhigov Yu. I., Chernyavskiy A. Yu. Algebraic tools for quantum information. (In Russian)

[2] Preskill J. Lecture notes for physics 229: Quantum information and computation //California Institute of Technology. – 1998.

[3] Nielsen, Michael A., and Isaac L. Chuang. Quantum computation and quantum information. Cambridge university press, 2010.

[1]

[2]

[3]

Difference between revisions of "Single-qubit transform of a state vector"

Latest revision as of 15:56, 19 July 2022

Contents

1 Properties and structure of the algorithm

1.1 General description of the algorithm

1.2 Mathematical description of the algorithm

1.3 The computational kernel of the algorithm

1.4 Macro structure of the algorithm

1.5 Implementation scheme of the serial algorithm

1.6 Serial complexity of the algorithm

1.7 Information graph

1.8 Parallelization resource of the algorithm

1.9 Input and output data of the algorithm

1.10 Properties of the algorithm

2 Software implementation of the algorithm

2.1 Implementation peculiarities of the serial algorithm

2.2 Possible methods and considerations for parallel implementation of the algorithm

2.3 Run results

2.4 Conclusions for different classes of computer architecture

3 References

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

File storage

Tools

In other languages

@@ Line 1: / Line 1: @@
 {{algorithm
-| name              = Однокубитное преобразование <br /> вектора-состояния
+| name              = Single-qubit transform <br /> of a state vector
 | serial_complexity = <math>3 \cdot 2^n</math>
 | pf_height         = <math>2</math>
@@ Line 8: / Line 8: @@
 }}
-Основные авторы описания: [[Участник:Chernyavskiy|А.Ю.Чернявский]], [[Участник:VadimVV|Вад.В.Воеводин]] ([[#Описание локальности данных и вычислений|раздел 2.2]])
+Primary authors of this description: [[:ru:Участник:Chernyavskiy|A.Yu.Chernyavskiy]].
 == Properties and structure of the algorithm ==
 === General description of the algorithm ===
 ''The algorithm simulates an action of a single-qubit quantum gate on a state vector.''
@@ Line 20: / Line 19: @@
 <ref>Nielsen, Michael A., and Isaac L. Chuang. Quantum computation and quantum information. Cambridge university press, 2010.</ref>
-This algorithm is often a subroutine and is used iteratively to the different qubits of a state. For example, it's used for the quantum algorithms simulation and the analysis of quantum entanglement. As for the most of the classical algorithms in quantum information science, the described algorithm shows the exponential data growth with respect to the number of qubits. This fact leads the necessity of a supercomputer usage for the practically important tasks.
+The operations with small number of qubits (quantum bits) are the basis of quantum computations. Moreover, it's known, that arbitrary quantum computation can be represented by only single- and two-qubit operations. Such operations are the analogues of simple classical logic operations (like NOT, AND, XOR, etc), but have some serious differences, which lead to the simulation difficulties. The main difference is in the fact that even single-qubit operation affects all other qubits due to the phenomena of quantum entanglement. Such "non-locality" gives the huge computational power to the quantum computers, but makes it very difficult to simulate them.
+The algorithm of single-qubit transform simulation is often a subroutine and is used iteratively to the different qubits of a state. For example, it's used for the quantum algorithms simulation and the analysis of quantum entanglement. As for the most of the classical algorithms in quantum information science, the described algorithm shows the exponential data growth with respect to the number of qubits. This fact leads to the necessity of a supercomputer usage for the practically important tasks.
 === Mathematical description of the algorithm ===
@@ Line 56: / Line 56: @@
 The computational kernel of the single-qubit transform can be composed of <math>2^n</math> calculations of the elements of the output vector <math>w.</math> The computation of each element consists of two complex multiplication and one complex addition opearations. Moreover we need to compute an indexes <math>i_1i_2\ldots 0_k \ldots i_n,</math> and the <math>i_k</math> bits, these computations needs the bitwise operations.
-=== Макроструктура алгоритма ===
+=== Macro structure of the algorithm ===
 As noted in [[#Computational kernel of the algorithm|the description of the computational kernel,]] the main part of the algorithm consists of the independent computations of the output vector elements.
@@ Line 78: / Line 78: @@
 The representation with an input and data and the "folded" triple operation is useful for the understanding of the memory locality.
 Note, the the structure of an information graph strongly depends on the parameter <math>k.</math>
-[[file:OneQubitVectorTransform1.png|thumb|center|800px|Рисунок 1. Граф алгоритма для <math>n=3, k=1</math> без отображения матрицы преобразования <math>U.</math> ]]
+[[file:OneQubitVectorTransform1.png|thumb|center|800px|Figure 1. The information graph for <math>n=3, k=1</math> without transformation matrix <math>U.</math> ]]
-[[file:OneQubitVectorTransform.png|thumb|center|800px|Рисунок 2. Граф алгоритма для <math>n=3, k=2</math> без отображения матрицы преобразования <math>U.</math> ]]
+[[file:OneQubitVectorTransform.png|thumb|center|800px|Figure 2. The information graph for <math>n=3, k=2</math> without transformation matrix <math>U.</math> ]]
-[[file:q_operation.png|thumb|center|300px|Рисунок 3. Граф основной операции.]]
+[[file:q_operation.png|thumb|center|300px|Figure 3. The information graph of the main operation.]]
 === Parallelization resource of the algorithm ===
@@ Line 92: / Line 92: @@
 '''Input data:'''
-* A <math>2^n</math>-dimensional complex state vector <math>u.</math> In most cases thу vector is normalized to 1.
+* <math>2^n</math>-dimensional complex state vector <math>u.</math> In most cases thу vector is normalized to 1.
-* A <math>2</math>-dimensional unitary matrix <math>U</math>>.
+* <math>2</math>-dimensional unitary matrix <math>U</math>.
 * The number (index) of the affected qubit <math>q</math>.
@@ Line 103: / Line 103: @@
 '''Amount of output data:''' <math>2^n</math> complex numbers.
-=== Свойства алгоритма ===
+=== Properties of the algorithm ===
-Соотношение последовательной и параллельной сложности является ''экспоненциальным'' (эксмпонента переходит в константу).
-Вычислительная мощность алгоритма, как отношение числа операций к суммарному объему входных и выходных данных – ''константа''.
+The ratio of the serial complexity to the parallel complexity is ''exponential'' (an exponent transforms to a constant).
-== Программная реализация алгоритма ==
+The computational power of the algorithm considered as the ratio of the number of operations to the amount of input and output data is ''constant''.
-=== Особенности реализации последовательного алгоритма ===
-На языке C функцию однокубитного преобразования можно записать следующим образом:
+== Software implementation of the algorithm ==
+=== Implementation peculiarities of the serial algorithm ===
+The function of the single-qubit transform may be represented in C language as:
 <source lang="C">
 void OneQubitEvolution(complexd *in, complexd *out, complexd U[2][2], int nqubits, int q)
 {
-	//n - число кубитов
+	//n - number of qubits
-         //q - номер кубита для преобразования
+         //q - index of the affected qubit
 	int shift = nqubits-q;
-	//Все биты нулевые, кроме соответствующего позиции преобразуемого кубита
+	//All bits are zeros, except the bit correspondent to the affected qubit.
 	int pow2q=1<<(shift);
@@ Line 125: / Line 126: @@
 	for	(int i=0; i<N; i++)
 	{
-		//Обнуления меняющегося бита
+		//Making the changing bit zero
 		int i0 = i & ~pow2q;
-		//Установка меняющегося бита
+		//Setting the changing bit value
 		int i1 = i | pow2q;
-		//Получение значения меняющегося бита
+		//Getting the changing bit value
 		int iq = (i & pow2q) >> shift;
@@ Line 139: / Line 140: @@
 </source>
-Отметим, что существенная часть вычислений и логики кода приходится на битовые операции, однако, этого можно избежать: однокубитное преобразование в большинстве случаев является лишь подпрограммой и применяется к разным кубитам большое число раз. В свою очередь, вычисляемые при помощи битовых операций индексы i0, i1 и iq зависят лишь от параметров количества кубитов n и номера кубита q. Число кубитов обычно фиксировано, соответственно, можно вычислить эти индексы заранее для всех q от 1 до n. Для хранения потребуется лишь массив целочисленных переменных линейного размера 3n в то время, как обрабатываемые данные имеют экспоненциальный размер. Очевидно, что такая оптимизация критически необходима при реализации алгоритма на вычислительных устройствах или языках программирования с отсутствием быстрых битовых операций (примером может служить среда Matlab).
+Note, that the huge part of the computation and code logic consists of the bit operations, but this can be avoided. In most cases, single-qubit transform is just as a subroutine and is used sequentially many times. Moreover, indexes i0, i1, and iq depends only on the parameters n and q. The n is fixed, and we can calculate these indexes in the beginning of computations for every parameter q, and we need only 3n integers to store the data, when the input and output data growths exponentially. It's obvious that such optimization is critically needed for the implementations of the algorithm for the hardware and software with slow bit operations (for example, Matlab).
-=== Локальность данных и вычислений ===
+=== Possible methods and considerations for parallel implementation of the algorithm ===
-К сожалению, в противовес идеальной возможности параллелизации алгоритма, практические реализации обладают очень плохой локальностью.
-Из математического описания и информационных графов для разных параметров q можно заметить, что при однократном применении однокубитного преобразования легко получить идеальную локальность обращения к данным простой перестановкой кубитов. Так, переместив кубит q на последнее место, мы получим взаимодействие лишь соседних по памяти элементов, причем в идеальном последовательном доступе.
+The main method of the parallelization is obvious: we need to parallelize the main loop (independent computations of the elements of the output vector), and possibly parallelize in-loop multiplications, which are independent too, as can bee seen from Fig.3. For shared memory systems such method leads to the near ideal speed-up. However, for the distributed memory machines the algorithm has the unsolvable problems with data transfer, that can be seen (for example, from Fig. 8), for big tasks (with the big number of qubits n), the the amount of transfer operations becomes comparable to the arithmetical operation. The possible particular solution is the usage of caching or different programming paradigms like SHMEM. But such unlocality makes it impossible to achieve ideal efficiency on distributed memory machines.
-Однако, преобразование одного кубита в прикладных задачах является лишь подпрограммой и применяется многократно с различными параметрами q. Как видно из математического описания и Рис. 1-2, это полностью исключает возможность добиться локальности обращений к данным.
+The algorithm is implemented (mostly serial versions) in different libraries for quantum information science and quantum computer simulation. For example: QLib, libquantum, QuantumPlayground, LIQUiD.
-==== Локальность реализации алгоритма ====
+=== Run results ===
-===== Структура обращений в память и качественная оценка локальности =====
+=== Conclusions for different classes of computer architecture ===
-[[file:qubit_1.png|thumb|center|700px|Рисунок 1. Однокубитное преобразование вектора-состояния. Общий профиль обращений в память]]
+Because of the very high parallelization possibility, but low locality, the effective and good scalable parallel implementation of the algorithm can be achieved for the shared memory. But the usage of distributed memory leads to the efficiency problems and necessity of special tricks for the optimization. It may be noted, that this task is a very good test platform for the development of methods for the data intense tasks with bad locality.
-На рис. 1 представлен профиль обращений в память для вычисления однокубитного преобразования вектора-состояния. Данный профиль состоит из обращений к трем массивам, фрагменты для отдельных массивов выделены на рис. 1 зеленым цветом. Из общего профиля можно увидеть, что обращения редко используются повторно, по крайней мере в случае фрагментов 2 и 3. При этом обращения к близко расположенным друг к другу данным выполняются рядом. Рассмотрим выделенные фрагменты поближе.
+== References ==
-Отдельно фрагмент 1 представлен на рис. 2. Видно, что данный массив состоит всего из 4-х элементов, к которым постоянно выполняются обращения. Такой фрагмент обладает очень высокой локальностью, поскольку постоянно используются ранее запрошенные данные.
+<references />
-Далее, рассмотрим фрагмент 2 (рис. 3). Здесь все еще проще – выполняется обычный последовательный перебор всех элементов массива. Такой фрагмент обладает высокой пространственной локальностью, однако очень низкой временной (данные не используются повторно).
-[[file:qubit_2.png|thumb|center|500px|Рисунок 2. Фрагмент 1 (профиль обращений к первому массиву)]]
+[[Ru:Однокубитное преобразование вектора-состояния]]
-[[file:qubit_3.png|thumb|center|500px|Рисунок 3. Фрагмент 2 (профиль обращений ко второму массиву)]]
+[[Category:Finished articles]]
-Наиболее интересным представляется фрагмент 3. Его небольшой фрагмент, выделенный на рис. 1 желтым, представлен на рис. 4. Однако при ближайшем рассмотрении оказывается, что данный фрагмент тоже просто устроен, хотя и немного сложнее предыдущих.
-В данном случае также виден в центре последовательный перебор всех элементов массива, параллельно с которым выполняются обращения либо к элементам с большим или меньшим виртуальным адресом. Отметим, однако, что эта разница между виртуальными адресами, судя по всему, больше 64 байт (длины строки), что может служить причиной возникновения большого числа кэш-промахов.
-[[file:qubit_4.png|thumb|center|500px|Рисунок 4. Небольшая часть фрагмента 3, выделенная на рис. 1 желтым]]
-В общем можно сказать, что общий профиль обращений в память обладает достаточно высокой пространственной локальностью, поскольку большинство обращений образуют последовательные переборы элементов массивов, однако временная локальность низка – данные практически не используются повторно.
-===== Количественная оценка локальности =====
-Основной фрагмент реализации, на основе которого были получены количественные оценки, приведен [http://git.parallel.ru/shvets.pavel.srcc/locality/blob/master/benchmarks/qubit_evolution/qubit_evolution.h здесь] (функция Kernel). Условия запуска описаны [http://git.parallel.ru/shvets.pavel.srcc/locality/blob/master/README.md здесь].
-Первая оценка выполняется на основе характеристики daps, которая оценивает число выполненных обращений (чтений и записей) в память в секунду. Данная характеристика является аналогом оценки flops применительно к работе с памятью и является в большей степени оценкой производительности взаимодействия с памятью, чем оценкой локальности. Однако она служит хорошим источником информации, в том числе для сравнения с результатами по следующей характеристике cvg.
-На рисунке 5 приведены значения daps для реализаций распространенных алгоритмов, отсортированные по возрастанию (чем больше daps, тем в общем случае выше производительность). Можно увидеть, что производительность работы с памятью для этой программы высока – значение daps примерно на уровне теста Linpack. Видимо, низкая временная локальность в данном случае компенсируется высокой пространственной локальностью.
-[[file:qubit_daps.png|thumb|center|700px|Рисунок 5. Сравнение значений оценки daps]]
-Вторая характеристика – cvg – предназначена для получения более машинно-независимой оценки локальности. Она определяет, насколько часто в программе необходимо подтягивать данные в кэш-память. Соответственно, чем меньше значение cvg, тем реже это нужно делать, тем лучше локальность.
-На рисунке 6 приведены значения cvg для того же набора реализаций, отсортированные по убыванию (чем меньше cvg, тем в общем случае выше локальность). Можно увидеть, что, в отличие от daps, cvg оценивает локальность данной программы как достаточно низкую. В частности, значение cvg для Linpack заметно меньше, в то время как значения daps практически совпадали.
-Одна из возможных причин этого – влияние арифметических операций. Может получиться, что данные из памяти не будут запрашиваться, пока арифметические операции не будут выполнены; это приводит к простою подсистемы памяти. Соответственно, если в одной программе таких операций нет, а в другой - есть, то daps в первом случае будет выше. При этом cvg не поменяется, поскольку эта оценка не зависит от времени выполнения.
-В данном случае арифметических операций практически нет (в отличие от некоторых других программ), поэтому daps может показывать более высокие результаты, в то время как cvg показывает достаточно низкую оценку.
-[[file:qubit_cvg.png|thumb|center|700px|Рисунок 6. Сравнение значений оценки cvg]]
-=== Возможные способы и особенности параллельной реализации алгоритма ===
-Основной способ параллельной реализации очевиден - необходимо распараллеливание основного цикла (параллельное вычисление различных компонент выходного вектора-состояния) и, желательно, операций умножения. На машинах с общей памятью такой вариант распараллеливания приводит к ускорению, близкому к максимально-возможному.
-Однако, данный способ сталкивается с проблемами локальности данных. Анализируя математическая описание и информационные графы алгоритма легко видеть, что при использовании большого числа узлов с собственной памятью количество необходимых пересылок между узлами становится сопоставимым с количеством вычислений, что приводит к существенной потере эффективности. Возможны разные пути частичного решения этой проблемы, например, кэширование или использование парадигмы программирования SHMEM, однако, столь сильная нелокальность использования данных всё равно не позволяет добиться хорошей эффективности.
-=== Масштабируемость алгоритма и его реализации ===
-==== Масштабируемость алгоритма ====
-==== Масштабируемость реализации алгоритма ====
-В связи с констатной высотой ярусно-параллельной формы алгоритм имеет отличную масштабируемость на машинах с общей памятью. При реализации на машинах с разделяемой памятью быстро возникают проблемы, связанные с локальностью.
-Проведём исследование масштабируемости параллельной реализации алгоритма согласно [[Scalability methodology|методике]]. Исследование проводилось на суперкомпьютере "Ломоносов"<ref name="Lom">Воеводин Вл., Жуматий С., Соболев С., Антонов А., Брызгалов П., Никитенко Д., Стефанов К., Воеводин Вад. Практика суперкомпьютера «Ломоносов» // Открытые системы, 2012, N 7, С. 36-39.</ref> [http://parallel.ru/cluster Суперкомпьютерного комплекса Московского университета].
-Набор и границы значений изменяемых [[Глоссарий#Параметры запуска|параметров запуска]] реализации алгоритма:
-*	число процессоров [4 : 512] со значениями степени двойки;
-*	число кубитов [16;32] с шагом 2.
-В результате проведённых экспериментов был получен следующий диапазон [[Глоссарий#Эффективность реализации|эффективности реализации]] алгоритма:
-*	минимальная эффективность реализации 0.000003%;
-*	максимальная эффективность реализации 0.008%.
-На следующих рисунках приведены графики [[Глоссарий#Производительность|производительности]] и эффективности выбранной реализации алгоритма в зависимости от изменяемых параметров запуска.
-[[file:Qbit perf.png|thumb|center|700px|Рисунок 7. Параллельная реализация алгоритма. Изменение производительности в зависимости от числа процессоров и числа кубитов.]]
-[[file:Qbit eff.png|thumb|center|700px|Рисунок 8. Параллельная реализация алгоритма. Изменение эффективности в зависимости от числа процессоров и числа кубитов.]]
-По результатам исследования можно сделать вывод о том, что эффективность алгоритма не является высокой, однако достаточно быстро возрастает с ростом размера задачи. Сложность задачи возрастает экспоненциально и потому имеет смысл решать задачу максимального размера, влезающую в оперативную память, чтобы увеличить общую эффективность вычислений.
-Исследованная [http://git.algowiki-project.org/Teplov/Scalability/blob/master/qbit/main.cpp параллельная реализация алгоритма]
-=== Динамические характеристики и эффективность реализации алгоритма ===
-Для проведения экспериментов использовалась реализация алгоритма однокубитного преобразования, в реализации, доступной [http://git.algowiki-project.org/Teplov/Scalability/blob/master/qbit/main.cpp здесь].  Все результаты получены на суперкомпьютере "ГрафИТ!". Использовались процессоры Intel Xeon X5570 с пиковой производительностью в 94 Гфлопс, а также компилятор Gnu 4.4.7.
-На рисунках показана эффективность реализации алгоритма встречной прогонки.
-[[file:Qbit CPU User.png|thumb|center|700px|Рисунок 9. График загрузки CPU при выполнении алгоритма однокубитного преобразования]]
-На графике загрузки процессора видно, что почти все время работы программы уровень загрузки составляет около 50% в среднем. Это указывает на высокую нагрузку выполнения алгоритма при котором на 8-ми ядерном процессоре работают 8 процессов, без использования технологии HyperTreading. Это означает, что использование всех физических ядер близко к 100%
-[[file:Qbit Flops.png|thumb|center|700px|Рисунок 10. График операций с плавающей точкой в секунду при выполнении алгоритма однокубитного преобразования]]
-На Рисунке 11 показан график количества операций с плавающей точкой в секунду. На графике видна общая достаточно высокая производительность вычислений, достигающая в пике 300 млн операций в секунду и около 70 млн операций в секунду в среднем по всем узлам. Это может указывать на некоторый дисбаланс в вычислениях, при общем сравнительно высоком уровне производительности.
- [[file:Qbit L1.png|thumb|center|700px|Рисунок 11. График кэш-промахов L1 в секунду при работе алгоритма однокубитного преобразования]]
-На графике кэш-промахов первого уровня видно, что число промахов достаточно высокое и находится на уровне 4 млн/сек для отдельных процессов и 1 млн/сек в среднем по всем процессам. Это указывает на достаточно высокую интенсивность вычислений и хорошую локальность данных, учитывая высокие показатели операций с плавающей точкой и на дисбаланс вычислений между процессами.
- [[file:Qbit L3.png|thumb|center|700px|Рисунок 12. График кэш-промахов L3 в секунду при работе алгоритма однокубитного преобразования]]
-На графике кэш-промахов третьего уровня видно, что число промахов все еще достаточно высокое для параллельного приложения и находится на уровне 1 млн/сек в пике и около 250 тыс/сек в среднем по всем узлам. Отношение промахов L1/L3 около 4, что является средним показателем по таким задачам. Это также указывает на достаточно плохую локальность вычислений.
-[[file:Qbit Mem Load.png|thumb|center|700px|Рисунок 13. График количества чтений из оперативной памяти при работе алгоритма однокубитного преобразования]]
-На графике чтений из памяти на наблюдается достаточно невысокая активность чтения из памяти процессами. Так как активно работает сразу большое число процессов, то и активность чтения различная между максимальным и средним значением указывает на дизбалас вычислений в прграмме. Активность немного ниже типичной для приложений такого класса.
- [[file:Qbit Mem Store.png|thumb|center|700px|Рисунок 14. График количества записей в оперативную память при работе алгоритма однокубитного преобразования]]
-Активность записи в память достаточно низкая, это вполне ожидаемо для такого алгоритма, и так же указывает на достаточно высокую локальность вычислений.
- [[file:Qbit Load Avg.png|thumb|center|700px|Рисунок 15. График числа процессов, ожидающих вхождения в стадию счета (Loadavg), при работе алгоритма однокубитного преобразования]]
-На графике числа процессов, ожидающих вхождения в стадию счета (Loadavg), видно, что на протяжении всей работы программы значение этого параметра постоянно и приблизительно равняется 3. Это свидетельствует о стабильной работе программы с работающим малым числом ядер на каждом вычислительном узле. Это подтверждает предположение о разбалансированности вычислений, потому как часть процессов не проводит активных вычислений.
-В целом, по данным системного мониторинга работы программы можно сделать вывод о том, что программа работала не очень эффективно, но с высоким дисбалансом нагрузки. Приложение имеет достаточно высокий ресурс для оптимизаций в ключе более рационального распределения нагрузки на вычислители.
-=== Выводы для классов архитектур ===
-Исходя из высочайшей возможности параллелизации и при этом наличия существенной нелокальности обращения к данным, эффективная и хорошо масштабируемая параллельная реализация алгоритма однокубитного квантового преобразования легко достижима на машинах с общей памятью. Реализация же на машинах с разделяемой памятью имеет низкую эффективность и требует специальных подходов для уменьшения времени работы. Можно отметить, что данная задача является хорошим плацдармом для разработки методов решения задач с интенсивным использованием данных и низкой локальностью.
-=== Существующие реализации алгоритма ===
-== Литература ==
-<references />