7.3.1 : L'avis de Maqao



Demandons son avis à maqao :
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
maqao cqa --fct-loops=grayscott_propagation ./GrayScottCompute/Vectorized/libgray_scott_vectorized_3x3.so 
Target processor is: 12th generation Intel Core processors and Intel Xeon processor product family based on Alder Lake microarchitecture (x86_64 architecture).

Section 1: Function: grayscott_propagation(float*, float*, float const*, float const*, long, long, float const*, float, float, float, float, float)
===================================================================================================================================================

Code for this function has been specialized for Alder Lake, Raptor Lake. For execution on another machine, recompile on it or with explicit target (example for a Haswell machine: use -march=haswell, see compiler manual for full list).
These loops are supposed to be defined in: XXX/Examples/GrayScottCompute/Vectorized/vectorized_propagation_3x3.cpp

Section 1.1: Source loop ending at line 95
==========================================

Composition and unrolling
-------------------------
It is composed of the following loops [ID (first-last source line)]:
 - 0 (59-95)
 - 2 (59-95)
and is unrolled by 8 (including vectorization).

The following loops are considered as:
 - unrolled and/or vectorized main: 2
 - peel or tail: 0
The analysis will be displayed for the unrolled and/or vectorized loops: 2

Section 1.1.1: Binary (unrolled and/or vectorized) loop #2
==========================================================

The loop is defined in XXX/Examples/GrayScottCompute/Vectorized/vectorized_propagation_3x3.cpp:59-95.

It is main loop of related source loop which is unrolled by 8 (including vectorization).
100% of peak computational performance is used (32.00 out of 32.00 FLOP per cycle (GFLOPS @ 1GHz))

Vectorization
-------------
Your loop is fully vectorized, using full register length.

All SSE/AVX instructions are used in vector version (process two or more data elements in vector registers).


Execution units bottlenecks
---------------------------
Performance is limited by:
 - execution of FP add operations (the FP add unit is a bottleneck)
 - execution of FP multiply or FMA (fused multiply-add) operations (the FP multiply/FMA unit is a bottleneck)

By removing all these bottlenecks, you can lower the cost of an iteration from 15.50 to 12.33 cycles (1.26x speedup).

Workaround(s):
 - Reduce the number of FP add instructions
 - Reduce the number of FP multiply/FMA instructions


FMA
---
Detected 160 FMA (fused multiply-add) operations.
Presence of both ADD/SUB and MUL operations.
Workaround(s):
Try to change order in which elements are evaluated (using parentheses) in arithmetic expressions containing both ADD/SUB and MUL operations to enable your compiler to generate FMA instructions wherever possible.
For instance a + b*c is a valid FMA (MUL then ADD).
However (a+b)* c cannot be translated into an FMA (ADD then MUL).


All innermost loops were analyzed.

Info: Rerun CQA with conf=hint,expert to display more advanced reports or conf=all to display them with default reports.


Manifestement, le compilateur a fait son travail correctement.

Testons le temps d'exécution :
time ./Program/GrayScottReaction/Vectorized/vectorized_gray_scott_3x3 -r 1080 -c 1920 -n 5 -e 6800
simulateImage : nbImage = 5, nbRow = 1080, nbCol = 1920
[========================================================================================================================================================|100%] 0s
Done

real 0m39,477s user 0m39,378s sys 0m0,052s



Le rapport expert de Maqao sur la compilation de G++ 11
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
maqao cqa --fct-loops=grayscott_propagation conf=expert ./GrayScottCompute/Vectorized/libgray_scott_vectorized_3x3.so 
Target processor is: 12th generation Intel Core processors and Intel Xeon processor product family based on Alder Lake microarchitecture (x86_64 architecture).

Section 1: Function: grayscott_propagation(float*, float*, float const*, float const*, long, long, float const*, float, float, float, float, float)
===================================================================================================================================================

Code for this function has been specialized for Alder Lake, Raptor Lake. For execution on another machine, recompile on it or with explicit target (example for a Haswell machine: use -march=haswell, see compiler manual for full list).
These loops are supposed to be defined in: XXX/Examples/GrayScottCompute/Vectorized/vectorized_propagation_3x3.cpp

Section 1.1: Source loop ending at line 95
==========================================

Composition and unrolling
-------------------------
It is composed of the following loops [ID (first-last source line)]:
 - 0 (59-95)
 - 2 (59-95)
and is unrolled by 8 (including vectorization).

The following loops are considered as:
 - unrolled and/or vectorized main: 2
 - peel or tail: 0
The analysis will be displayed for the unrolled and/or vectorized loops: 2

Section 1.1.1: Binary (unrolled and/or vectorized) loop #2
==========================================================

The loop is defined in XXX/Examples/GrayScottCompute/Vectorized/vectorized_propagation_3x3.cpp:59-95.

It is main loop of related source loop which is unrolled by 8 (including vectorization).
100% of peak computational performance is used (32.00 out of 32.00 FLOP per cycle (GFLOPS @ 1GHz))

General properties
------------------
nb instructions    : 75
nb uops            : 74
loop length        : 413
used x86 registers : 15
used mmx registers : 0
used xmm registers : 0
used ymm registers : 16
used zmm registers : 0
nb stack references: 13
ADD-SUB / MUL ratio: 3.40


Front-end
---------
ASSUMED MACRO FUSION
FIT IN UOP CACHE
micro-operation queue: 12.33 cycles
front end            : 12.33 cycles


Back-end
--------
       | P0    | P1    | P2    | P3    | P4   | P5    | P6   | P7   | P8   | P9   | P10  | P11
------------------------------------------------------------------------------------------------
uops   | 15.50 | 15.50 | 10.33 | 10.33 | 1.50 | 11.00 | 1.00 | 1.50 | 1.50 | 1.50 | 0.00 | 10.33
cycles | 15.50 | 15.50 | 10.33 | 10.33 | 1.50 | 11.00 | 1.00 | 1.50 | 1.50 | 1.50 | 0.00 | 10.33

Cycles executing div or sqrt instructions: NA
Longest recurrence chain latency (RecMII): 1.00


Cycles summary
--------------
Front-end : 12.33
Dispatch  : 15.50
Data deps.: 1.00
Overall L1: 15.50


Vectorization ratios
--------------------
all     : 100%
load    : 100%
store   : 100%
mul     : 100%
add-sub : 100%
fma     : 100%
div/sqrt: NA (no div/sqrt vectorizable/vectorized instructions)
other   : NA (no other vectorizable/vectorized instructions)


Vector efficiency ratios
------------------------
all     : 100%
load    : 100%
store   : 100%
mul     : 100%
add-sub : 100%
fma     : 100%
div/sqrt: NA (no div/sqrt vectorizable/vectorized instructions)
other   : NA (no other vectorizable/vectorized instructions)


Cycles and memory resources usage
---------------------------------
Assuming all data fit into the L1 cache, each iteration of the binary loop takes 15.50 cycles. At this rate:
 - 52% of peak load performance is reached (50.06 out of 96.00 bytes loaded per cycle (GB/s @ 1GHz))
 - 9% of peak store performance is reached (6.19 out of 64.00 bytes stored per cycle (GB/s @ 1GHz))


Front-end bottlenecks
---------------------
Found no such bottlenecks.

ASM code
--------
In the binary file, the address of the loop is: 15d0

Instruction                                                    | Nb FU | P0   | P1   | P2   | P3   | P4   | P5   | P6   | P7   | P8   | P9   | P10  | P11  | Latency | Recip. throughput
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
MOV 0x238(%RSP),%RDX                                           | 1     | 0    | 0    | 0.33 | 0.33 | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0.33 | 1       | 0.33
VMOVUPS (%R14,%RAX,1),%YMM3                                    | 1     | 0    | 0    | 0.33 | 0.33 | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0.33 | 0-1     | 0.33
VMOVUPS (%RDX,%RAX,1),%YMM0                                    | 1     | 0    | 0    | 0.33 | 0.33 | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0.33 | 0-1     | 0.33
MOV 0x2b0(%RSP),%RDX                                           | 1     | 0    | 0    | 0.33 | 0.33 | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0.33 | 1       | 0.33
VMOVAPS 0xa0d(%RIP),%YMM12                                     | 1     | 0    | 0    | 0.33 | 0.33 | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0.33 | 0-1     | 0.33
VMOVUPS (%RDX,%RAX,1),%YMM4                                    | 1     | 0    | 0    | 0.33 | 0.33 | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0.33 | 0-1     | 0.33
MOV 0x230(%RSP),%RDX                                           | 1     | 0    | 0    | 0.33 | 0.33 | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0.33 | 1       | 0.33
VSUBPS %YMM0,%YMM4,%YMM1                                       | 1     | 0    | 0.50 | 0    | 0    | 0    | 0.50 | 0    | 0    | 0    | 0    | 0    | 0    | 3       | 0.50
VMOVUPS (%RDX,%RAX,1),%YMM4                                    | 1     | 0    | 0    | 0.33 | 0.33 | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0.33 | 0-1     | 0.33
MOV 0x2a8(%RSP),%RDX                                           | 1     | 0    | 0    | 0.33 | 0.33 | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0.33 | 1       | 0.33
VSUBPS %YMM0,%YMM4,%YMM2                                       | 1     | 0    | 0.50 | 0    | 0    | 0    | 0.50 | 0    | 0    | 0    | 0    | 0    | 0    | 3       | 0.50
VMOVUPS (%R13,%RAX,1),%YMM4                                    | 1     | 0    | 0    | 0.33 | 0.33 | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0.33 | 0-1     | 0.33
VSUBPS %YMM0,%YMM12,%YMM12                                     | 1     | 0    | 0.50 | 0    | 0    | 0    | 0.50 | 0    | 0    | 0    | 0    | 0    | 0    | 3       | 0.50
VMULPS %YMM10,%YMM2,%YMM2                                      | 1     | 0.50 | 0.50 | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 4       | 0.50
VFMADD132PS %YMM11,%YMM2,%YMM1                                 | 1     | 0.50 | 0.50 | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 4       | 0.50
VSUBPS %YMM0,%YMM4,%YMM2                                       | 1     | 0    | 0.50 | 0    | 0    | 0    | 0.50 | 0    | 0    | 0    | 0    | 0    | 0    | 3       | 0.50
VMOVUPS (%RBX,%RAX,1),%YMM4                                    | 1     | 0    | 0    | 0.33 | 0.33 | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0.33 | 0-1     | 0.33
VFMADD132PS %YMM9,%YMM1,%YMM2                                  | 1     | 0.50 | 0.50 | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 4       | 0.50
VSUBPS %YMM0,%YMM4,%YMM1                                       | 1     | 0    | 0.50 | 0    | 0    | 0    | 0.50 | 0    | 0    | 0    | 0    | 0    | 0    | 3       | 0.50
VMOVUPS (%R15,%RAX,1),%YMM4                                    | 1     | 0    | 0    | 0.33 | 0.33 | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0.33 | 0-1     | 0.33
VFMADD132PS %YMM8,%YMM2,%YMM1                                  | 1     | 0.50 | 0.50 | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 4       | 0.50
VSUBPS %YMM0,%YMM4,%YMM2                                       | 1     | 0    | 0.50 | 0    | 0    | 0    | 0.50 | 0    | 0    | 0    | 0    | 0    | 0    | 3       | 0.50
VMOVUPS (%R11,%RAX,1),%YMM4                                    | 1     | 0    | 0    | 0.33 | 0.33 | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0.33 | 0-1     | 0.33
VFMADD132PS %YMM7,%YMM1,%YMM2                                  | 1     | 0.50 | 0.50 | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 4       | 0.50
VSUBPS %YMM0,%YMM3,%YMM1                                       | 1     | 0    | 0.50 | 0    | 0    | 0    | 0.50 | 0    | 0    | 0    | 0    | 0    | 0    | 3       | 0.50
VMOVUPS (%R10,%RAX,1),%YMM3                                    | 1     | 0    | 0    | 0.33 | 0.33 | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0.33 | 0-1     | 0.33
VFMADD132PS %YMM5,%YMM2,%YMM1                                  | 1     | 0.50 | 0.50 | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 4       | 0.50
VSUBPS %YMM0,%YMM4,%YMM2                                       | 1     | 0    | 0.50 | 0    | 0    | 0    | 0.50 | 0    | 0    | 0    | 0    | 0    | 0    | 3       | 0.50
VMOVUPS (%R9,%RAX,1),%YMM4                                     | 1     | 0    | 0    | 0.33 | 0.33 | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0.33 | 0-1     | 0.33
VFMADD132PS %YMM13,%YMM1,%YMM2                                 | 1     | 0.50 | 0.50 | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 4       | 0.50
VSUBPS %YMM0,%YMM3,%YMM1                                       | 1     | 0    | 0.50 | 0    | 0    | 0    | 0.50 | 0    | 0    | 0    | 0    | 0    | 0    | 3       | 0.50
VMOVUPS (%RDX,%RAX,1),%YMM3                                    | 1     | 0    | 0    | 0.33 | 0.33 | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0.33 | 0-1     | 0.33
MOV 0x280(%RSP),%RDX                                           | 1     | 0    | 0    | 0.33 | 0.33 | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0.33 | 1       | 0.33
VSUBPS %YMM0,%YMM3,%YMM3                                       | 1     | 0    | 0.50 | 0    | 0    | 0    | 0.50 | 0    | 0    | 0    | 0    | 0    | 0    | 3       | 0.50
VFMADD132PS %YMM5,%YMM2,%YMM1                                  | 1     | 0.50 | 0.50 | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 4       | 0.50
VMOVUPS (%R8,%RAX,1),%YMM2                                     | 1     | 0    | 0    | 0.33 | 0.33 | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0.33 | 0-1     | 0.33
VSUBPS %YMM0,%YMM2,%YMM2                                       | 1     | 0    | 0.50 | 0    | 0    | 0    | 0.50 | 0    | 0    | 0    | 0    | 0    | 0    | 3       | 0.50
VMULPS 0x1a8(%RSP),%YMM1,%YMM1                                 | 1     | 0.50 | 0.50 | 0.33 | 0.33 | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0.33 | 4       | 0.50
VMULPS %YMM10,%YMM2,%YMM2                                      | 1     | 0.50 | 0.50 | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 4       | 0.50
VFMADD231PS %YMM11,%YMM3,%YMM2                                 | 1     | 0.50 | 0.50 | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 4       | 0.50
VMOVUPS (%RDI,%RAX,1),%YMM3                                    | 1     | 0    | 0    | 0.33 | 0.33 | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0.33 | 0-1     | 0.33
VSUBPS %YMM0,%YMM3,%YMM3                                       | 1     | 0    | 0.50 | 0    | 0    | 0    | 0.50 | 0    | 0    | 0    | 0    | 0    | 0    | 3       | 0.50
VFMADD231PS %YMM9,%YMM3,%YMM2                                  | 1     | 0.50 | 0.50 | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 4       | 0.50
VMOVUPS (%R12,%RAX,1),%YMM3                                    | 1     | 0    | 0    | 0.33 | 0.33 | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0.33 | 0-1     | 0.33
VSUBPS %YMM0,%YMM3,%YMM3                                       | 1     | 0    | 0.50 | 0    | 0    | 0    | 0.50 | 0    | 0    | 0    | 0    | 0    | 0    | 3       | 0.50
VFMADD132PS %YMM8,%YMM2,%YMM3                                  | 1     | 0.50 | 0.50 | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 4       | 0.50
VMOVUPS (%RSI,%RAX,1),%YMM2                                    | 1     | 0    | 0    | 0.33 | 0.33 | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0.33 | 0-1     | 0.33
VSUBPS %YMM0,%YMM2,%YMM2                                       | 1     | 0    | 0.50 | 0    | 0    | 0    | 0.50 | 0    | 0    | 0    | 0    | 0    | 0    | 3       | 0.50
VFMADD132PS %YMM7,%YMM3,%YMM2                                  | 1     | 0.50 | 0.50 | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 4       | 0.50
VMOVUPS (%RDX,%RAX,1),%YMM3                                    | 1     | 0    | 0    | 0.33 | 0.33 | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0.33 | 0-1     | 0.33
MOV 0x228(%RSP),%RDX                                           | 1     | 0    | 0    | 0.33 | 0.33 | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0.33 | 1       | 0.33
VMOVAPS %YMM3,0x288(%RSP)                                      | 1     | 0    | 0    | 0    | 0    | 0.50 | 0    | 0    | 0.50 | 0.50 | 0.50 | 0    | 0    | 0-1     | 0.50
VSUBPS %YMM0,%YMM3,%YMM3                                       | 1     | 0    | 0.50 | 0    | 0    | 0    | 0.50 | 0    | 0    | 0    | 0    | 0    | 0    | 3       | 0.50
VFMADD132PS %YMM5,%YMM2,%YMM3                                  | 1     | 0.50 | 0.50 | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 4       | 0.50
VMOVUPS (%RCX,%RAX,1),%YMM2                                    | 1     | 0    | 0    | 0.33 | 0.33 | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0.33 | 0-1     | 0.33
VSUBPS %YMM0,%YMM2,%YMM2                                       | 1     | 0    | 0.50 | 0    | 0    | 0    | 0.50 | 0    | 0    | 0    | 0    | 0    | 0    | 3       | 0.50
VFMADD132PS %YMM13,%YMM3,%YMM2                                 | 1     | 0.50 | 0.50 | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 4       | 0.50
VMOVUPS (%RDX,%RAX,1),%YMM3                                    | 1     | 0    | 0    | 0.33 | 0.33 | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0.33 | 0-1     | 0.33
MOV 0x248(%RSP),%RDX                                           | 1     | 0    | 0    | 0.33 | 0.33 | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0.33 | 1       | 0.33
VSUBPS %YMM0,%YMM3,%YMM3                                       | 1     | 0    | 0.50 | 0    | 0    | 0    | 0.50 | 0    | 0    | 0    | 0    | 0    | 0    | 3       | 0.50
VFMADD132PS 0x1e8(%RSP),%YMM2,%YMM3                            | 1     | 0.50 | 0.50 | 0.33 | 0.33 | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0.33 | 4       | 0.50
VMULPS %YMM0,%YMM4,%YMM2                                       | 1     | 0.50 | 0.50 | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 4       | 0.50
VMULPS %YMM3,%YMM14,%YMM3                                      | 1     | 0.50 | 0.50 | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 4       | 0.50
VFNMADD231PS %YMM4,%YMM2,%YMM1                                 | 1     | 0.50 | 0.50 | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 4       | 0.50
VFMADD132PS %YMM4,%YMM3,%YMM2                                  | 1     | 0.50 | 0.50 | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 4       | 0.50
VFMADD231PS %YMM12,%YMM15,%YMM1                                | 1     | 0.50 | 0.50 | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 4       | 0.50
VFNMADD231PS 0x208(%RSP),%YMM4,%YMM2                           | 1     | 0.50 | 0.50 | 0.33 | 0.33 | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0.33 | 4       | 0.50
VFMADD132PS %YMM6,%YMM0,%YMM1                                  | 1     | 0.50 | 0.50 | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 4       | 0.50
VFMADD132PS %YMM6,%YMM4,%YMM2                                  | 1     | 0.50 | 0.50 | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 4       | 0.50
VMOVUPS %YMM1,(%RDX,%RAX,1)                                    | 1     | 0    | 0    | 0    | 0    | 0.50 | 0    | 0    | 0.50 | 0.50 | 0.50 | 0    | 0    | 0-1     | 0.50
MOV 0x240(%RSP),%RDX                                           | 1     | 0    | 0    | 0.33 | 0.33 | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0.33 | 1       | 0.33
VMOVUPS %YMM2,(%RDX,%RAX,1)                                    | 1     | 0    | 0    | 0    | 0    | 0.50 | 0    | 0    | 0.50 | 0.50 | 0.50 | 0    | 0    | 0-1     | 0.50
ADD $0x20,%RAX                                                 | 1     | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 1       | 0.17
CMP %RAX,0x1e0(%RSP)                                           | 1     | 0.20 | 0.20 | 0.33 | 0.33 | 0    | 0.20 | 0.20 | 0    | 0    | 0    | 0.20 | 0.33 | 1       | 0.33
JNE 15d0 <_Z21grayscott_propagationPfS_PKfS1_llS1_fffff+0x460> | 1     | 0.50 | 0    | 0    | 0    | 0    | 0    | 0.50 | 0    | 0    | 0    | 0    | 0    | 0       | 0.50



All innermost loops were analyzed.
Bref, il nous dit que l'on est à 100% du pic de performance.

Faisons une V2 :
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
maqao cqa --fct-loops=grayscott_propagation ./GrayScottCompute/Vectorized/libgray_scott_vectorized_3x3_v2.so 
Target processor is: 12th generation Intel Core processors and Intel Xeon processor product family based on Alder Lake microarchitecture (x86_64 architecture).

Section 1: Function: grayscott_propagation(float*, float*, float const*, float const*, long, long, float const*, float, float, float, float, float)
===================================================================================================================================================

Code for this function has been specialized for Alder Lake, Raptor Lake. For execution on another machine, recompile on it or with explicit target (example for a Haswell machine: use -march=haswell, see compiler manual for full list).
These loops are supposed to be defined in: XXX/Examples/GrayScottCompute/Vectorized/vectorized_propagation_3x3_v2.cpp

Section 1.1: Source loop ending at line 116
===========================================

Composition and unrolling
-------------------------
It is composed of the following loops [ID (first-last source line)]:
 - 0 (59-116)
 - 2 (59-116)
and is unrolled by 8 (including vectorization).

The following loops are considered as:
 - unrolled and/or vectorized main: 2
 - peel or tail: 0
The analysis will be displayed for the unrolled and/or vectorized loops: 2

Section 1.1.1: Binary (unrolled and/or vectorized) loop #2
==========================================================

The loop is defined in XXX/Examples/GrayScottCompute/Vectorized/vectorized_propagation_3x3_v2.cpp:59-116.

It is main loop of related source loop which is unrolled by 8 (including vectorization).
88% of peak computational performance is used (28.34 out of 32.00 FLOP per cycle (GFLOPS @ 1GHz))

Vectorization
-------------
Your loop is fully vectorized, using full register length.

All SSE/AVX instructions are used in vector version (process two or more data elements in vector registers).


Execution units bottlenecks
---------------------------
Performance is limited by:
 - execution of FP add operations (the FP add unit is a bottleneck)
 - execution of FP multiply or FMA (fused multiply-add) operations (the FP multiply/FMA unit is a bottleneck)

Workaround(s):
 - Reduce the number of FP add instructions
 - Reduce the number of FP multiply/FMA instructions


FMA
---
Detected 112 FMA (fused multiply-add) operations.
Presence of both ADD/SUB and MUL operations.
Workaround(s):
Try to change order in which elements are evaluated (using parentheses) in arithmetic expressions containing both ADD/SUB and MUL operations to enable your compiler to generate FMA instructions wherever possible.
For instance a + b*c is a valid FMA (MUL then ADD).
However (a+b)* c cannot be translated into an FMA (ADD then MUL).


All innermost loops were analyzed.

Info: Rerun CQA with conf=hint,expert to display more advanced reports or conf=all to display them with default reports.


Bon, d'après Maqao elle est moins bien que la V1 (tant mieux car elle n'est pas très lisible).

Et il a bien raison car le temps d'exécution est plus long :
time ./Program/GrayScottReaction/Vectorized/vectorized_gray_scott_3x3_v2 -r 1080 -c 1920 -n 5 -e 6800
simulateImage : nbImage = 5, nbRow = 1080, nbCol = 1920
[========================================================================================================================================================|100%] 0s
Done

real 1m8,904s user 1m8,796s sys 0m0,060s


Le temps d'exécution est 30 secondes plus long que le précédent. Même si nous avons exprimé moins de FMA (112 contre 160 précédemment). Manifestement il y a un coup à faire, mais on y est allé trop fort.

Faisons une version 3 et demandons à maqao :

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
maqao cqa --fct-loops=grayscott_propagation ./GrayScottCompute/Vectorized/libgray_scott_vectorized_3x3_v3.so 
Target processor is: 12th generation Intel Core processors and Intel Xeon processor product family based on Alder Lake microarchitecture (x86_64 architecture).

Section 1: Function: grayscott_propagation(float*, float*, float const*, float const*, long, long, float const*, float, float, float, float, float)
===================================================================================================================================================

Code for this function has been specialized for Alder Lake, Raptor Lake. For execution on another machine, recompile on it or with explicit target (example for a Haswell machine: use -march=haswell, see compiler manual for full list).
These loops are supposed to be defined in: XXX/Examples/GrayScottCompute/Vectorized/vectorized_propagation_3x3_v3.cpp

Section 1.1: Source loop ending at line 112
===========================================

Composition and unrolling
-------------------------
It is composed of the following loops [ID (first-last source line)]:
 - 0 (59-112)
 - 2 (59-112)
and is unrolled by 8 (including vectorization).

The following loops are considered as:
 - unrolled and/or vectorized main: 2
 - peel or tail: 0
The analysis will be displayed for the unrolled and/or vectorized loops: 2

Section 1.1.1: Binary (unrolled and/or vectorized) loop #2
==========================================================

The loop is defined in XXX/Examples/GrayScottCompute/Vectorized/vectorized_propagation_3x3_v3.cpp:59-112.

It is main loop of related source loop which is unrolled by 8 (including vectorization).
91% of peak computational performance is used (29.18 out of 32.00 FLOP per cycle (GFLOPS @ 1GHz))

Vectorization
-------------
Your loop is fully vectorized, using full register length.

All SSE/AVX instructions are used in vector version (process two or more data elements in vector registers).


Execution units bottlenecks
---------------------------
Performance is limited by:
 - execution of FP add operations (the FP add unit is a bottleneck)
 - execution of FP multiply or FMA (fused multiply-add) operations (the FP multiply/FMA unit is a bottleneck)

By removing all these bottlenecks, you can lower the cost of an iteration from 17.00 to 14.00 cycles (1.21x speedup).

Workaround(s):
 - Reduce the number of FP add instructions
 - Reduce the number of FP multiply/FMA instructions


FMA
---
Detected 112 FMA (fused multiply-add) operations.
Presence of both ADD/SUB and MUL operations.
Workaround(s):
Try to change order in which elements are evaluated (using parentheses) in arithmetic expressions containing both ADD/SUB and MUL operations to enable your compiler to generate FMA instructions wherever possible.
For instance a + b*c is a valid FMA (MUL then ADD).
However (a+b)* c cannot be translated into an FMA (ADD then MUL).


All innermost loops were analyzed.

Info: Rerun CQA with conf=hint,expert to display more advanced reports or conf=all to display them with default reports.


Comme ça ne change rien, il ne faut pas s'attendre à de meilleures performances :

time ./Program/GrayScottReaction/Vectorized/vectorized_gray_scott_3x3_v3 -r 1080 -c 1920 -n 5 -e 6800
simulateImage : nbImage = 5, nbRow = 1080, nbCol = 1920
[========================================================================================================================================================|100%] 0s
Done

real 1m10,070s user 1m9,963s sys 0m0,056s


Effectivement, ce n'est pas mieux. Même si maqao nous dit que l'on est à 91% du pic de performance et plus à 88% comme dans la V2.


Le rapport expert de Maqao
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
maqao cqa --fct-loops=grayscott_propagation conf=expert ./GrayScottCompute/Vectorized/libgray_scott_vectorized_3x3_v3.so 
Target processor is: 12th generation Intel Core processors and Intel Xeon processor product family based on Alder Lake microarchitecture (x86_64 architecture).

Section 1: Function: grayscott_propagation(float*, float*, float const*, float const*, long, long, float const*, float, float, float, float, float)
===================================================================================================================================================

Code for this function has been specialized for Alder Lake, Raptor Lake. For execution on another machine, recompile on it or with explicit target (example for a Haswell machine: use -march=haswell, see compiler manual for full list).
These loops are supposed to be defined in: XXX/Examples/GrayScottCompute/Vectorized/vectorized_propagation_3x3_v3.cpp

Section 1.1: Source loop ending at line 112
===========================================

Composition and unrolling
-------------------------
It is composed of the following loops [ID (first-last source line)]:
 - 0 (59-112)
 - 2 (59-112)
and is unrolled by 8 (including vectorization).

The following loops are considered as:
 - unrolled and/or vectorized main: 2
 - peel or tail: 0
The analysis will be displayed for the unrolled and/or vectorized loops: 2

Section 1.1.1: Binary (unrolled and/or vectorized) loop #2
==========================================================

The loop is defined in XXX/Examples/GrayScottCompute/Vectorized/vectorized_propagation_3x3_v3.cpp:59-112.

It is main loop of related source loop which is unrolled by 8 (including vectorization).
91% of peak computational performance is used (29.18 out of 32.00 FLOP per cycle (GFLOPS @ 1GHz))

General properties
------------------
nb instructions    : 83
nb uops            : 82
loop length        : 465
used x86 registers : 15
used mmx registers : 0
used xmm registers : 0
used ymm registers : 16
used zmm registers : 0
nb stack references: 16
ADD-SUB / MUL ratio: 2.09


Front-end
---------
ASSUMED MACRO FUSION
FIT IN UOP CACHE
micro-operation queue: 13.67 cycles
front end            : 13.67 cycles


Back-end
--------
       | P0    | P1    | P2    | P3    | P4   | P5    | P6   | P7   | P8   | P9   | P10  | P11
------------------------------------------------------------------------------------------------
uops   | 17.00 | 17.00 | 11.33 | 11.33 | 1.50 | 14.00 | 1.00 | 1.50 | 1.50 | 1.50 | 0.00 | 11.33
cycles | 17.00 | 17.00 | 11.33 | 11.33 | 1.50 | 14.00 | 1.00 | 1.50 | 1.50 | 1.50 | 0.00 | 11.33

Cycles executing div or sqrt instructions: NA
Longest recurrence chain latency (RecMII): 1.00


Cycles summary
--------------
Front-end : 13.67
Dispatch  : 17.00
Data deps.: 1.00
Overall L1: 17.00


Vectorization ratios
--------------------
all     : 100%
load    : 100%
store   : 100%
mul     : 100%
add-sub : 100%
fma     : 100%
div/sqrt: NA (no div/sqrt vectorizable/vectorized instructions)
other   : NA (no other vectorizable/vectorized instructions)


Vector efficiency ratios
------------------------
all     : 100%
load    : 100%
store   : 100%
mul     : 100%
add-sub : 100%
fma     : 100%
div/sqrt: NA (no div/sqrt vectorizable/vectorized instructions)
other   : NA (no other vectorizable/vectorized instructions)


Cycles and memory resources usage
---------------------------------
Assuming all data fit into the L1 cache, each iteration of the binary loop takes 17.00 cycles. At this rate:
 - 53% of peak load performance is reached (51.29 out of 96.00 bytes loaded per cycle (GB/s @ 1GHz))
 - 8% of peak store performance is reached (5.65 out of 64.00 bytes stored per cycle (GB/s @ 1GHz))


Front-end bottlenecks
---------------------
Found no such bottlenecks.

ASM code
--------
In the binary file, the address of the loop is: 1580

Instruction                                                    | Nb FU | P0   | P1   | P2   | P3   | P4   | P5   | P6   | P7   | P8   | P9   | P10  | P11  | Latency | Recip. throughput
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
MOV 0x250(%RSP),%R11                                           | 1     | 0    | 0    | 0.33 | 0.33 | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0.33 | 1       | 0.33
VMOVUPS (%RBX,%RAX,1),%YMM7                                    | 1     | 0    | 0    | 0.33 | 0.33 | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0.33 | 0-1     | 0.33
VMOVUPS (%R11,%RAX,1),%YMM2                                    | 1     | 0    | 0    | 0.33 | 0.33 | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0.33 | 0-1     | 0.33
MOV 0x248(%RSP),%R11                                           | 1     | 0    | 0    | 0.33 | 0.33 | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0.33 | 1       | 0.33
VMOVAPS 0x1a8(%RSP),%YMM8                                      | 1     | 0    | 0    | 0.33 | 0.33 | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0.33 | 0-1     | 0.33
VMOVUPS (%R11,%RAX,1),%YMM1                                    | 1     | 0    | 0    | 0.33 | 0.33 | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0.33 | 0-1     | 0.33
MOV 0x238(%RSP),%R11                                           | 1     | 0    | 0    | 0.33 | 0.33 | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0.33 | 1       | 0.33
VSUBPS %YMM1,%YMM7,%YMM0                                       | 1     | 0    | 0.50 | 0    | 0    | 0    | 0.50 | 0    | 0    | 0    | 0    | 0    | 0    | 3       | 0.50
VMOVUPS (%R11,%RAX,1),%YMM6                                    | 1     | 0    | 0    | 0.33 | 0.33 | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0.33 | 0-1     | 0.33
MOV 0x2b0(%RSP),%R11                                           | 1     | 0    | 0    | 0.33 | 0.33 | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0.33 | 1       | 0.33
VSUBPS %YMM1,%YMM6,%YMM4                                       | 1     | 0    | 0.50 | 0    | 0    | 0    | 0.50 | 0    | 0    | 0    | 0    | 0    | 0    | 3       | 0.50
VMOVUPS (%R11,%RAX,1),%YMM7                                    | 1     | 0    | 0    | 0.33 | 0.33 | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0.33 | 0-1     | 0.33
MOV 0x240(%RSP),%R11                                           | 1     | 0    | 0    | 0.33 | 0.33 | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0.33 | 1       | 0.33
VSUBPS %YMM2,%YMM7,%YMM3                                       | 1     | 0    | 0.50 | 0    | 0    | 0    | 0.50 | 0    | 0    | 0    | 0    | 0    | 0    | 3       | 0.50
VMOVUPS (%R11,%RAX,1),%YMM7                                    | 1     | 0    | 0    | 0.33 | 0.33 | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0.33 | 0-1     | 0.33
VMOVAPS 0x188(%RSP),%YMM6                                      | 1     | 0    | 0    | 0.33 | 0.33 | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0.33 | 0-1     | 0.33
VSUBPS %YMM2,%YMM7,%YMM5                                       | 1     | 0    | 0.50 | 0    | 0    | 0    | 0.50 | 0    | 0    | 0    | 0    | 0    | 0    | 3       | 0.50
VMULPS %YMM6,%YMM4,%YMM4                                       | 1     | 0.50 | 0.50 | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 4       | 0.50
MOV 0x230(%RSP),%R11                                           | 1     | 0    | 0    | 0.33 | 0.33 | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0.33 | 1       | 0.33
VMULPS %YMM6,%YMM5,%YMM5                                       | 1     | 0.50 | 0.50 | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 4       | 0.50
VMOVUPS (%R10,%RAX,1),%YMM6                                    | 1     | 0    | 0    | 0.33 | 0.33 | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0.33 | 0-1     | 0.33
VMOVUPS (%RCX,%RAX,1),%YMM9                                    | 1     | 0    | 0    | 0.33 | 0.33 | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0.33 | 0-1     | 0.33
VFMADD132PS %YMM10,%YMM4,%YMM0                                 | 1     | 0.50 | 0.50 | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 4       | 0.50
VMOVUPS (%R11,%RAX,1),%YMM4                                    | 1     | 0    | 0    | 0.33 | 0.33 | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0.33 | 0-1     | 0.33
VFMADD132PS %YMM10,%YMM5,%YMM3                                 | 1     | 0.50 | 0.50 | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 4       | 0.50
VSUBPS %YMM1,%YMM6,%YMM5                                       | 1     | 0    | 0.50 | 0    | 0    | 0    | 0.50 | 0    | 0    | 0    | 0    | 0    | 0    | 3       | 0.50
VMOVUPS (%R15,%RAX,1),%YMM6                                    | 1     | 0    | 0    | 0.33 | 0.33 | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0.33 | 0-1     | 0.33
VSUBPS %YMM2,%YMM4,%YMM7                                       | 1     | 0    | 0.50 | 0    | 0    | 0    | 0.50 | 0    | 0    | 0    | 0    | 0    | 0    | 3       | 0.50
VSUBPS %YMM1,%YMM6,%YMM6                                       | 1     | 0    | 0.50 | 0    | 0    | 0    | 0.50 | 0    | 0    | 0    | 0    | 0    | 0    | 3       | 0.50
VMOVAPS %YMM4,0x288(%RSP)                                      | 1     | 0    | 0    | 0    | 0    | 0.50 | 0    | 0    | 0.50 | 0.50 | 0.50 | 0    | 0    | 0-1     | 0.50
VMOVUPS (%R14,%RAX,1),%YMM4                                    | 1     | 0    | 0    | 0.33 | 0.33 | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0.33 | 0-1     | 0.33
VMULPS %YMM15,%YMM6,%YMM6                                      | 1     | 0.50 | 0.50 | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 4       | 0.50
VSUBPS %YMM2,%YMM4,%YMM4                                       | 1     | 0    | 0.50 | 0    | 0    | 0    | 0.50 | 0    | 0    | 0    | 0    | 0    | 0    | 3       | 0.50
VSUBPS %YMM2,%YMM9,%YMM9                                       | 1     | 0    | 0.50 | 0    | 0    | 0    | 0.50 | 0    | 0    | 0    | 0    | 0    | 0    | 3       | 0.50
VMULPS %YMM15,%YMM4,%YMM4                                      | 1     | 0.50 | 0.50 | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 4       | 0.50
VFMADD132PS %YMM8,%YMM6,%YMM5                                  | 1     | 0.50 | 0.50 | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 4       | 0.50
VMOVUPS (%R8,%RAX,1),%YMM6                                     | 1     | 0    | 0    | 0.33 | 0.33 | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0.33 | 0-1     | 0.33
VMULPS %YMM11,%YMM9,%YMM9                                      | 1     | 0.50 | 0.50 | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 4       | 0.50
VFMADD231PS %YMM8,%YMM7,%YMM4                                  | 1     | 0.50 | 0.50 | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 4       | 0.50
VSUBPS %YMM1,%YMM6,%YMM7                                       | 1     | 0    | 0.50 | 0    | 0    | 0    | 0.50 | 0    | 0    | 0    | 0    | 0    | 0    | 3       | 0.50
VMOVUPS (%R13,%RAX,1),%YMM6                                    | 1     | 0    | 0    | 0.33 | 0.33 | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0.33 | 0-1     | 0.33
VADDPS %YMM0,%YMM5,%YMM5                                       | 1     | 0    | 0.50 | 0    | 0    | 0    | 0.50 | 0    | 0    | 0    | 0    | 0    | 0    | 3       | 0.50
VSUBPS %YMM1,%YMM6,%YMM0                                       | 1     | 0    | 0.50 | 0    | 0    | 0    | 0.50 | 0    | 0    | 0    | 0    | 0    | 0    | 3       | 0.50
VADDPS %YMM3,%YMM4,%YMM4                                       | 1     | 0    | 0.50 | 0    | 0    | 0    | 0.50 | 0    | 0    | 0    | 0    | 0    | 0    | 3       | 0.50
VMOVUPS (%R12,%RAX,1),%YMM3                                    | 1     | 0    | 0    | 0.33 | 0.33 | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0.33 | 0-1     | 0.33
VMULPS %YMM11,%YMM0,%YMM0                                      | 1     | 0.50 | 0.50 | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 4       | 0.50
VSUBPS %YMM2,%YMM3,%YMM3                                       | 1     | 0    | 0.50 | 0    | 0    | 0    | 0.50 | 0    | 0    | 0    | 0    | 0    | 0    | 3       | 0.50
VMOVUPS (%R9,%RAX,1),%YMM6                                     | 1     | 0    | 0    | 0.33 | 0.33 | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0.33 | 0-1     | 0.33
VMULPS %YMM11,%YMM3,%YMM3                                      | 1     | 0.50 | 0.50 | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 4       | 0.50
VFMADD132PS %YMM14,%YMM0,%YMM7                                 | 1     | 0.50 | 0.50 | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 4       | 0.50
VMOVUPS (%RDX,%RAX,1),%YMM0                                    | 1     | 0    | 0    | 0.33 | 0.33 | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0.33 | 0-1     | 0.33
VSUBPS %YMM2,%YMM6,%YMM6                                       | 1     | 0    | 0.50 | 0    | 0    | 0    | 0.50 | 0    | 0    | 0    | 0    | 0    | 0    | 3       | 0.50
VSUBPS %YMM1,%YMM0,%YMM8                                       | 1     | 0    | 0.50 | 0    | 0    | 0    | 0.50 | 0    | 0    | 0    | 0    | 0    | 0    | 3       | 0.50
VMOVUPS (%RDI,%RAX,1),%YMM0                                    | 1     | 0    | 0    | 0.33 | 0.33 | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0.33 | 0-1     | 0.33
VFMADD132PS %YMM14,%YMM3,%YMM6                                 | 1     | 0.50 | 0.50 | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 4       | 0.50
VMULPS %YMM11,%YMM8,%YMM8                                      | 1     | 0.50 | 0.50 | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 4       | 0.50
VMOVUPS (%RSI,%RAX,1),%YMM3                                    | 1     | 0    | 0    | 0.33 | 0.33 | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0.33 | 0-1     | 0.33
VSUBPS %YMM2,%YMM0,%YMM0                                       | 1     | 0    | 0.50 | 0    | 0    | 0    | 0.50 | 0    | 0    | 0    | 0    | 0    | 0    | 3       | 0.50
VSUBPS %YMM1,%YMM3,%YMM3                                       | 1     | 0    | 0.50 | 0    | 0    | 0    | 0.50 | 0    | 0    | 0    | 0    | 0    | 0    | 3       | 0.50
VFMADD132PS %YMM13,%YMM9,%YMM0                                 | 1     | 0.50 | 0.50 | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 4       | 0.50
VFMADD132PS %YMM13,%YMM8,%YMM3                                 | 1     | 0.50 | 0.50 | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 4       | 0.50
VADDPS %YMM6,%YMM0,%YMM0                                       | 1     | 0    | 0.50 | 0    | 0    | 0    | 0.50 | 0    | 0    | 0    | 0    | 0    | 0    | 3       | 0.50
VADDPS %YMM7,%YMM3,%YMM3                                       | 1     | 0    | 0.50 | 0    | 0    | 0    | 0.50 | 0    | 0    | 0    | 0    | 0    | 0    | 3       | 0.50
VMOVAPS 0x927(%RIP),%YMM7                                      | 1     | 0    | 0    | 0.33 | 0.33 | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0.33 | 0-1     | 0.33
VADDPS %YMM4,%YMM0,%YMM0                                       | 1     | 0    | 0.50 | 0    | 0    | 0    | 0.50 | 0    | 0    | 0    | 0    | 0    | 0    | 3       | 0.50
VADDPS %YMM5,%YMM3,%YMM3                                       | 1     | 0    | 0.50 | 0    | 0    | 0    | 0.50 | 0    | 0    | 0    | 0    | 0    | 0    | 3       | 0.50
VMULPS %YMM2,%YMM1,%YMM5                                       | 1     | 0.50 | 0.50 | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 4       | 0.50
VMULPS 0x168(%RSP),%YMM0,%YMM0                                 | 1     | 0.50 | 0.50 | 0.33 | 0.33 | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0.33 | 4       | 0.50
VMULPS 0x1e8(%RSP),%YMM3,%YMM3                                 | 1     | 0.50 | 0.50 | 0.33 | 0.33 | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0.33 | 4       | 0.50
VSUBPS %YMM2,%YMM7,%YMM4                                       | 1     | 0    | 0.50 | 0    | 0    | 0    | 0.50 | 0    | 0    | 0    | 0    | 0    | 0    | 3       | 0.50
VFNMADD231PS %YMM1,%YMM5,%YMM0                                 | 1     | 0.50 | 0.50 | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 4       | 0.50
VFMADD132PS %YMM1,%YMM3,%YMM5                                  | 1     | 0.50 | 0.50 | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 4       | 0.50
VFMADD231PS 0x1c8(%RSP),%YMM4,%YMM0                            | 1     | 0.50 | 0.50 | 0.33 | 0.33 | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0.33 | 4       | 0.50
VFNMADD231PS 0x208(%RSP),%YMM1,%YMM5                           | 1     | 0.50 | 0.50 | 0.33 | 0.33 | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0.33 | 4       | 0.50
MOV 0x260(%RSP),%R11                                           | 1     | 0    | 0    | 0.33 | 0.33 | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0.33 | 1       | 0.33
VFMADD132PS %YMM12,%YMM2,%YMM0                                 | 1     | 0.50 | 0.50 | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 4       | 0.50
VFMADD132PS %YMM12,%YMM1,%YMM5                                 | 1     | 0.50 | 0.50 | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 4       | 0.50
VMOVUPS %YMM0,(%R11,%RAX,1)                                    | 1     | 0    | 0    | 0    | 0    | 0.50 | 0    | 0    | 0.50 | 0.50 | 0.50 | 0    | 0    | 0-1     | 0.50
MOV 0x258(%RSP),%R11                                           | 1     | 0    | 0    | 0.33 | 0.33 | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0.33 | 1       | 0.33
VMOVUPS %YMM5,(%R11,%RAX,1)                                    | 1     | 0    | 0    | 0    | 0    | 0.50 | 0    | 0    | 0.50 | 0.50 | 0.50 | 0    | 0    | 0-1     | 0.50
ADD $0x20,%RAX                                                 | 1     | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 1       | 0.17
CMP %RAX,0x228(%RSP)                                           | 1     | 0.20 | 0.20 | 0.33 | 0.33 | 0    | 0.20 | 0.20 | 0    | 0    | 0    | 0.20 | 0.33 | 1       | 0.33
JNE 1580 <_Z21grayscott_propagationPfS_PKfS1_llS1_fffff+0x410> | 1     | 0.50 | 0    | 0    | 0    | 0    | 0    | 0.50 | 0    | 0    | 0    | 0    | 0    | 0       | 0.50



All innermost loops were analyzed.
Et là, il nous que que l'on est à 91% du pic de performance.

Moralité, il faut toujours laisser un peu de liberté au compilateur pour qu'il puisse optimiser les calculs correctement.