7.3.1 : L'avis de Maqao
Demandons son avis à maqao :
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 |
maqao cqa --fct-loops=grayscott_propagation ./GrayScottCompute/Vectorized/libgray_scott_vectorized_3x3.so Target processor is: 12th generation Intel Core processors and Intel Xeon processor product family based on Alder Lake microarchitecture (x86_64 architecture). Section 1: Function: grayscott_propagation(float*, float*, float const*, float const*, long, long, float const*, float, float, float, float, float) =================================================================================================================================================== Code for this function has been specialized for Alder Lake, Raptor Lake. For execution on another machine, recompile on it or with explicit target (example for a Haswell machine: use -march=haswell, see compiler manual for full list). These loops are supposed to be defined in: XXX/Examples/GrayScottCompute/Vectorized/vectorized_propagation_3x3.cpp Section 1.1: Source loop ending at line 95 ========================================== Composition and unrolling ------------------------- It is composed of the following loops [ID (first-last source line)]: - 0 (59-95) - 2 (59-95) and is unrolled by 8 (including vectorization). The following loops are considered as: - unrolled and/or vectorized main: 2 - peel or tail: 0 The analysis will be displayed for the unrolled and/or vectorized loops: 2 Section 1.1.1: Binary (unrolled and/or vectorized) loop #2 ========================================================== The loop is defined in XXX/Examples/GrayScottCompute/Vectorized/vectorized_propagation_3x3.cpp:59-95. It is main loop of related source loop which is unrolled by 8 (including vectorization). 100% of peak computational performance is used (32.00 out of 32.00 FLOP per cycle (GFLOPS @ 1GHz)) Vectorization ------------- Your loop is fully vectorized, using full register length. All SSE/AVX instructions are used in vector version (process two or more data elements in vector registers). Execution units bottlenecks --------------------------- Performance is limited by: - execution of FP add operations (the FP add unit is a bottleneck) - execution of FP multiply or FMA (fused multiply-add) operations (the FP multiply/FMA unit is a bottleneck) By removing all these bottlenecks, you can lower the cost of an iteration from 15.50 to 12.33 cycles (1.26x speedup). Workaround(s): - Reduce the number of FP add instructions - Reduce the number of FP multiply/FMA instructions FMA --- Detected 160 FMA (fused multiply-add) operations. Presence of both ADD/SUB and MUL operations. Workaround(s): Try to change order in which elements are evaluated (using parentheses) in arithmetic expressions containing both ADD/SUB and MUL operations to enable your compiler to generate FMA instructions wherever possible. For instance a + b*c is a valid FMA (MUL then ADD). However (a+b)* c cannot be translated into an FMA (ADD then MUL). All innermost loops were analyzed. Info: Rerun CQA with conf=hint,expert to display more advanced reports or conf=all to display them with default reports. |
Manifestement, le compilateur a fait son travail correctement.
Testons le temps d'exécution :
time ./Program/GrayScottReaction/Vectorized/vectorized_gray_scott_3x3 -r 1080 -c 1920 -n 5 -e 6800 simulateImage : nbImage = 5, nbRow = 1080, nbCol = 1920 [========================================================================================================================================================|100%] 0s Done
real 0m39,477s user 0m39,378s sys 0m0,052s
Le rapport expert de Maqao sur la compilation de G++ 11
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 |
maqao cqa --fct-loops=grayscott_propagation conf=expert ./GrayScottCompute/Vectorized/libgray_scott_vectorized_3x3.so Target processor is: 12th generation Intel Core processors and Intel Xeon processor product family based on Alder Lake microarchitecture (x86_64 architecture). Section 1: Function: grayscott_propagation(float*, float*, float const*, float const*, long, long, float const*, float, float, float, float, float) =================================================================================================================================================== Code for this function has been specialized for Alder Lake, Raptor Lake. For execution on another machine, recompile on it or with explicit target (example for a Haswell machine: use -march=haswell, see compiler manual for full list). These loops are supposed to be defined in: XXX/Examples/GrayScottCompute/Vectorized/vectorized_propagation_3x3.cpp Section 1.1: Source loop ending at line 95 ========================================== Composition and unrolling ------------------------- It is composed of the following loops [ID (first-last source line)]: - 0 (59-95) - 2 (59-95) and is unrolled by 8 (including vectorization). The following loops are considered as: - unrolled and/or vectorized main: 2 - peel or tail: 0 The analysis will be displayed for the unrolled and/or vectorized loops: 2 Section 1.1.1: Binary (unrolled and/or vectorized) loop #2 ========================================================== The loop is defined in XXX/Examples/GrayScottCompute/Vectorized/vectorized_propagation_3x3.cpp:59-95. It is main loop of related source loop which is unrolled by 8 (including vectorization). 100% of peak computational performance is used (32.00 out of 32.00 FLOP per cycle (GFLOPS @ 1GHz)) General properties ------------------ nb instructions : 75 nb uops : 74 loop length : 413 used x86 registers : 15 used mmx registers : 0 used xmm registers : 0 used ymm registers : 16 used zmm registers : 0 nb stack references: 13 ADD-SUB / MUL ratio: 3.40 Front-end --------- ASSUMED MACRO FUSION FIT IN UOP CACHE micro-operation queue: 12.33 cycles front end : 12.33 cycles Back-end -------- | P0 | P1 | P2 | P3 | P4 | P5 | P6 | P7 | P8 | P9 | P10 | P11 ------------------------------------------------------------------------------------------------ uops | 15.50 | 15.50 | 10.33 | 10.33 | 1.50 | 11.00 | 1.00 | 1.50 | 1.50 | 1.50 | 0.00 | 10.33 cycles | 15.50 | 15.50 | 10.33 | 10.33 | 1.50 | 11.00 | 1.00 | 1.50 | 1.50 | 1.50 | 0.00 | 10.33 Cycles executing div or sqrt instructions: NA Longest recurrence chain latency (RecMII): 1.00 Cycles summary -------------- Front-end : 12.33 Dispatch : 15.50 Data deps.: 1.00 Overall L1: 15.50 Vectorization ratios -------------------- all : 100% load : 100% store : 100% mul : 100% add-sub : 100% fma : 100% div/sqrt: NA (no div/sqrt vectorizable/vectorized instructions) other : NA (no other vectorizable/vectorized instructions) Vector efficiency ratios ------------------------ all : 100% load : 100% store : 100% mul : 100% add-sub : 100% fma : 100% div/sqrt: NA (no div/sqrt vectorizable/vectorized instructions) other : NA (no other vectorizable/vectorized instructions) Cycles and memory resources usage --------------------------------- Assuming all data fit into the L1 cache, each iteration of the binary loop takes 15.50 cycles. At this rate: - 52% of peak load performance is reached (50.06 out of 96.00 bytes loaded per cycle (GB/s @ 1GHz)) - 9% of peak store performance is reached (6.19 out of 64.00 bytes stored per cycle (GB/s @ 1GHz)) Front-end bottlenecks --------------------- Found no such bottlenecks. ASM code -------- In the binary file, the address of the loop is: 15d0 Instruction | Nb FU | P0 | P1 | P2 | P3 | P4 | P5 | P6 | P7 | P8 | P9 | P10 | P11 | Latency | Recip. throughput ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- MOV 0x238(%RSP),%RDX | 1 | 0 | 0 | 0.33 | 0.33 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.33 | 1 | 0.33 VMOVUPS (%R14,%RAX,1),%YMM3 | 1 | 0 | 0 | 0.33 | 0.33 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.33 | 0-1 | 0.33 VMOVUPS (%RDX,%RAX,1),%YMM0 | 1 | 0 | 0 | 0.33 | 0.33 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.33 | 0-1 | 0.33 MOV 0x2b0(%RSP),%RDX | 1 | 0 | 0 | 0.33 | 0.33 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.33 | 1 | 0.33 VMOVAPS 0xa0d(%RIP),%YMM12 | 1 | 0 | 0 | 0.33 | 0.33 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.33 | 0-1 | 0.33 VMOVUPS (%RDX,%RAX,1),%YMM4 | 1 | 0 | 0 | 0.33 | 0.33 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.33 | 0-1 | 0.33 MOV 0x230(%RSP),%RDX | 1 | 0 | 0 | 0.33 | 0.33 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.33 | 1 | 0.33 VSUBPS %YMM0,%YMM4,%YMM1 | 1 | 0 | 0.50 | 0 | 0 | 0 | 0.50 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 0.50 VMOVUPS (%RDX,%RAX,1),%YMM4 | 1 | 0 | 0 | 0.33 | 0.33 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.33 | 0-1 | 0.33 MOV 0x2a8(%RSP),%RDX | 1 | 0 | 0 | 0.33 | 0.33 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.33 | 1 | 0.33 VSUBPS %YMM0,%YMM4,%YMM2 | 1 | 0 | 0.50 | 0 | 0 | 0 | 0.50 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 0.50 VMOVUPS (%R13,%RAX,1),%YMM4 | 1 | 0 | 0 | 0.33 | 0.33 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.33 | 0-1 | 0.33 VSUBPS %YMM0,%YMM12,%YMM12 | 1 | 0 | 0.50 | 0 | 0 | 0 | 0.50 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 0.50 VMULPS %YMM10,%YMM2,%YMM2 | 1 | 0.50 | 0.50 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 4 | 0.50 VFMADD132PS %YMM11,%YMM2,%YMM1 | 1 | 0.50 | 0.50 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 4 | 0.50 VSUBPS %YMM0,%YMM4,%YMM2 | 1 | 0 | 0.50 | 0 | 0 | 0 | 0.50 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 0.50 VMOVUPS (%RBX,%RAX,1),%YMM4 | 1 | 0 | 0 | 0.33 | 0.33 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.33 | 0-1 | 0.33 VFMADD132PS %YMM9,%YMM1,%YMM2 | 1 | 0.50 | 0.50 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 4 | 0.50 VSUBPS %YMM0,%YMM4,%YMM1 | 1 | 0 | 0.50 | 0 | 0 | 0 | 0.50 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 0.50 VMOVUPS (%R15,%RAX,1),%YMM4 | 1 | 0 | 0 | 0.33 | 0.33 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.33 | 0-1 | 0.33 VFMADD132PS %YMM8,%YMM2,%YMM1 | 1 | 0.50 | 0.50 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 4 | 0.50 VSUBPS %YMM0,%YMM4,%YMM2 | 1 | 0 | 0.50 | 0 | 0 | 0 | 0.50 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 0.50 VMOVUPS (%R11,%RAX,1),%YMM4 | 1 | 0 | 0 | 0.33 | 0.33 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.33 | 0-1 | 0.33 VFMADD132PS %YMM7,%YMM1,%YMM2 | 1 | 0.50 | 0.50 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 4 | 0.50 VSUBPS %YMM0,%YMM3,%YMM1 | 1 | 0 | 0.50 | 0 | 0 | 0 | 0.50 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 0.50 VMOVUPS (%R10,%RAX,1),%YMM3 | 1 | 0 | 0 | 0.33 | 0.33 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.33 | 0-1 | 0.33 VFMADD132PS %YMM5,%YMM2,%YMM1 | 1 | 0.50 | 0.50 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 4 | 0.50 VSUBPS %YMM0,%YMM4,%YMM2 | 1 | 0 | 0.50 | 0 | 0 | 0 | 0.50 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 0.50 VMOVUPS (%R9,%RAX,1),%YMM4 | 1 | 0 | 0 | 0.33 | 0.33 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.33 | 0-1 | 0.33 VFMADD132PS %YMM13,%YMM1,%YMM2 | 1 | 0.50 | 0.50 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 4 | 0.50 VSUBPS %YMM0,%YMM3,%YMM1 | 1 | 0 | 0.50 | 0 | 0 | 0 | 0.50 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 0.50 VMOVUPS (%RDX,%RAX,1),%YMM3 | 1 | 0 | 0 | 0.33 | 0.33 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.33 | 0-1 | 0.33 MOV 0x280(%RSP),%RDX | 1 | 0 | 0 | 0.33 | 0.33 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.33 | 1 | 0.33 VSUBPS %YMM0,%YMM3,%YMM3 | 1 | 0 | 0.50 | 0 | 0 | 0 | 0.50 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 0.50 VFMADD132PS %YMM5,%YMM2,%YMM1 | 1 | 0.50 | 0.50 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 4 | 0.50 VMOVUPS (%R8,%RAX,1),%YMM2 | 1 | 0 | 0 | 0.33 | 0.33 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.33 | 0-1 | 0.33 VSUBPS %YMM0,%YMM2,%YMM2 | 1 | 0 | 0.50 | 0 | 0 | 0 | 0.50 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 0.50 VMULPS 0x1a8(%RSP),%YMM1,%YMM1 | 1 | 0.50 | 0.50 | 0.33 | 0.33 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.33 | 4 | 0.50 VMULPS %YMM10,%YMM2,%YMM2 | 1 | 0.50 | 0.50 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 4 | 0.50 VFMADD231PS %YMM11,%YMM3,%YMM2 | 1 | 0.50 | 0.50 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 4 | 0.50 VMOVUPS (%RDI,%RAX,1),%YMM3 | 1 | 0 | 0 | 0.33 | 0.33 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.33 | 0-1 | 0.33 VSUBPS %YMM0,%YMM3,%YMM3 | 1 | 0 | 0.50 | 0 | 0 | 0 | 0.50 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 0.50 VFMADD231PS %YMM9,%YMM3,%YMM2 | 1 | 0.50 | 0.50 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 4 | 0.50 VMOVUPS (%R12,%RAX,1),%YMM3 | 1 | 0 | 0 | 0.33 | 0.33 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.33 | 0-1 | 0.33 VSUBPS %YMM0,%YMM3,%YMM3 | 1 | 0 | 0.50 | 0 | 0 | 0 | 0.50 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 0.50 VFMADD132PS %YMM8,%YMM2,%YMM3 | 1 | 0.50 | 0.50 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 4 | 0.50 VMOVUPS (%RSI,%RAX,1),%YMM2 | 1 | 0 | 0 | 0.33 | 0.33 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.33 | 0-1 | 0.33 VSUBPS %YMM0,%YMM2,%YMM2 | 1 | 0 | 0.50 | 0 | 0 | 0 | 0.50 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 0.50 VFMADD132PS %YMM7,%YMM3,%YMM2 | 1 | 0.50 | 0.50 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 4 | 0.50 VMOVUPS (%RDX,%RAX,1),%YMM3 | 1 | 0 | 0 | 0.33 | 0.33 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.33 | 0-1 | 0.33 MOV 0x228(%RSP),%RDX | 1 | 0 | 0 | 0.33 | 0.33 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.33 | 1 | 0.33 VMOVAPS %YMM3,0x288(%RSP) | 1 | 0 | 0 | 0 | 0 | 0.50 | 0 | 0 | 0.50 | 0.50 | 0.50 | 0 | 0 | 0-1 | 0.50 VSUBPS %YMM0,%YMM3,%YMM3 | 1 | 0 | 0.50 | 0 | 0 | 0 | 0.50 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 0.50 VFMADD132PS %YMM5,%YMM2,%YMM3 | 1 | 0.50 | 0.50 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 4 | 0.50 VMOVUPS (%RCX,%RAX,1),%YMM2 | 1 | 0 | 0 | 0.33 | 0.33 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.33 | 0-1 | 0.33 VSUBPS %YMM0,%YMM2,%YMM2 | 1 | 0 | 0.50 | 0 | 0 | 0 | 0.50 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 0.50 VFMADD132PS %YMM13,%YMM3,%YMM2 | 1 | 0.50 | 0.50 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 4 | 0.50 VMOVUPS (%RDX,%RAX,1),%YMM3 | 1 | 0 | 0 | 0.33 | 0.33 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.33 | 0-1 | 0.33 MOV 0x248(%RSP),%RDX | 1 | 0 | 0 | 0.33 | 0.33 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.33 | 1 | 0.33 VSUBPS %YMM0,%YMM3,%YMM3 | 1 | 0 | 0.50 | 0 | 0 | 0 | 0.50 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 0.50 VFMADD132PS 0x1e8(%RSP),%YMM2,%YMM3 | 1 | 0.50 | 0.50 | 0.33 | 0.33 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.33 | 4 | 0.50 VMULPS %YMM0,%YMM4,%YMM2 | 1 | 0.50 | 0.50 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 4 | 0.50 VMULPS %YMM3,%YMM14,%YMM3 | 1 | 0.50 | 0.50 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 4 | 0.50 VFNMADD231PS %YMM4,%YMM2,%YMM1 | 1 | 0.50 | 0.50 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 4 | 0.50 VFMADD132PS %YMM4,%YMM3,%YMM2 | 1 | 0.50 | 0.50 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 4 | 0.50 VFMADD231PS %YMM12,%YMM15,%YMM1 | 1 | 0.50 | 0.50 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 4 | 0.50 VFNMADD231PS 0x208(%RSP),%YMM4,%YMM2 | 1 | 0.50 | 0.50 | 0.33 | 0.33 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.33 | 4 | 0.50 VFMADD132PS %YMM6,%YMM0,%YMM1 | 1 | 0.50 | 0.50 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 4 | 0.50 VFMADD132PS %YMM6,%YMM4,%YMM2 | 1 | 0.50 | 0.50 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 4 | 0.50 VMOVUPS %YMM1,(%RDX,%RAX,1) | 1 | 0 | 0 | 0 | 0 | 0.50 | 0 | 0 | 0.50 | 0.50 | 0.50 | 0 | 0 | 0-1 | 0.50 MOV 0x240(%RSP),%RDX | 1 | 0 | 0 | 0.33 | 0.33 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.33 | 1 | 0.33 VMOVUPS %YMM2,(%RDX,%RAX,1) | 1 | 0 | 0 | 0 | 0 | 0.50 | 0 | 0 | 0.50 | 0.50 | 0.50 | 0 | 0 | 0-1 | 0.50 ADD $0x20,%RAX | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0.17 CMP %RAX,0x1e0(%RSP) | 1 | 0.20 | 0.20 | 0.33 | 0.33 | 0 | 0.20 | 0.20 | 0 | 0 | 0 | 0.20 | 0.33 | 1 | 0.33 JNE 15d0 <_Z21grayscott_propagationPfS_PKfS1_llS1_fffff+0x460> | 1 | 0.50 | 0 | 0 | 0 | 0 | 0 | 0.50 | 0 | 0 | 0 | 0 | 0 | 0 | 0.50 All innermost loops were analyzed. |
Faisons une V2 :
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 |
maqao cqa --fct-loops=grayscott_propagation ./GrayScottCompute/Vectorized/libgray_scott_vectorized_3x3_v2.so Target processor is: 12th generation Intel Core processors and Intel Xeon processor product family based on Alder Lake microarchitecture (x86_64 architecture). Section 1: Function: grayscott_propagation(float*, float*, float const*, float const*, long, long, float const*, float, float, float, float, float) =================================================================================================================================================== Code for this function has been specialized for Alder Lake, Raptor Lake. For execution on another machine, recompile on it or with explicit target (example for a Haswell machine: use -march=haswell, see compiler manual for full list). These loops are supposed to be defined in: XXX/Examples/GrayScottCompute/Vectorized/vectorized_propagation_3x3_v2.cpp Section 1.1: Source loop ending at line 116 =========================================== Composition and unrolling ------------------------- It is composed of the following loops [ID (first-last source line)]: - 0 (59-116) - 2 (59-116) and is unrolled by 8 (including vectorization). The following loops are considered as: - unrolled and/or vectorized main: 2 - peel or tail: 0 The analysis will be displayed for the unrolled and/or vectorized loops: 2 Section 1.1.1: Binary (unrolled and/or vectorized) loop #2 ========================================================== The loop is defined in XXX/Examples/GrayScottCompute/Vectorized/vectorized_propagation_3x3_v2.cpp:59-116. It is main loop of related source loop which is unrolled by 8 (including vectorization). 88% of peak computational performance is used (28.34 out of 32.00 FLOP per cycle (GFLOPS @ 1GHz)) Vectorization ------------- Your loop is fully vectorized, using full register length. All SSE/AVX instructions are used in vector version (process two or more data elements in vector registers). Execution units bottlenecks --------------------------- Performance is limited by: - execution of FP add operations (the FP add unit is a bottleneck) - execution of FP multiply or FMA (fused multiply-add) operations (the FP multiply/FMA unit is a bottleneck) Workaround(s): - Reduce the number of FP add instructions - Reduce the number of FP multiply/FMA instructions FMA --- Detected 112 FMA (fused multiply-add) operations. Presence of both ADD/SUB and MUL operations. Workaround(s): Try to change order in which elements are evaluated (using parentheses) in arithmetic expressions containing both ADD/SUB and MUL operations to enable your compiler to generate FMA instructions wherever possible. For instance a + b*c is a valid FMA (MUL then ADD). However (a+b)* c cannot be translated into an FMA (ADD then MUL). All innermost loops were analyzed. Info: Rerun CQA with conf=hint,expert to display more advanced reports or conf=all to display them with default reports. |
Bon, d'après Maqao elle est moins bien que la V1 (tant mieux car elle n'est pas très lisible).
Et il a bien raison car le temps d'exécution est plus long :
time ./Program/GrayScottReaction/Vectorized/vectorized_gray_scott_3x3_v2 -r 1080 -c 1920 -n 5 -e 6800 simulateImage : nbImage = 5, nbRow = 1080, nbCol = 1920 [========================================================================================================================================================|100%] 0s Done
real 1m8,904s user 1m8,796s sys 0m0,060s
Le temps d'exécution est 30 secondes plus long que le précédent. Même si nous avons exprimé moins de FMA (112 contre 160 précédemment). Manifestement il y a un coup à faire, mais on y est allé trop fort.
Faisons une version 3 et demandons à maqao :
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 |
maqao cqa --fct-loops=grayscott_propagation ./GrayScottCompute/Vectorized/libgray_scott_vectorized_3x3_v3.so Target processor is: 12th generation Intel Core processors and Intel Xeon processor product family based on Alder Lake microarchitecture (x86_64 architecture). Section 1: Function: grayscott_propagation(float*, float*, float const*, float const*, long, long, float const*, float, float, float, float, float) =================================================================================================================================================== Code for this function has been specialized for Alder Lake, Raptor Lake. For execution on another machine, recompile on it or with explicit target (example for a Haswell machine: use -march=haswell, see compiler manual for full list). These loops are supposed to be defined in: XXX/Examples/GrayScottCompute/Vectorized/vectorized_propagation_3x3_v3.cpp Section 1.1: Source loop ending at line 112 =========================================== Composition and unrolling ------------------------- It is composed of the following loops [ID (first-last source line)]: - 0 (59-112) - 2 (59-112) and is unrolled by 8 (including vectorization). The following loops are considered as: - unrolled and/or vectorized main: 2 - peel or tail: 0 The analysis will be displayed for the unrolled and/or vectorized loops: 2 Section 1.1.1: Binary (unrolled and/or vectorized) loop #2 ========================================================== The loop is defined in XXX/Examples/GrayScottCompute/Vectorized/vectorized_propagation_3x3_v3.cpp:59-112. It is main loop of related source loop which is unrolled by 8 (including vectorization). 91% of peak computational performance is used (29.18 out of 32.00 FLOP per cycle (GFLOPS @ 1GHz)) Vectorization ------------- Your loop is fully vectorized, using full register length. All SSE/AVX instructions are used in vector version (process two or more data elements in vector registers). Execution units bottlenecks --------------------------- Performance is limited by: - execution of FP add operations (the FP add unit is a bottleneck) - execution of FP multiply or FMA (fused multiply-add) operations (the FP multiply/FMA unit is a bottleneck) By removing all these bottlenecks, you can lower the cost of an iteration from 17.00 to 14.00 cycles (1.21x speedup). Workaround(s): - Reduce the number of FP add instructions - Reduce the number of FP multiply/FMA instructions FMA --- Detected 112 FMA (fused multiply-add) operations. Presence of both ADD/SUB and MUL operations. Workaround(s): Try to change order in which elements are evaluated (using parentheses) in arithmetic expressions containing both ADD/SUB and MUL operations to enable your compiler to generate FMA instructions wherever possible. For instance a + b*c is a valid FMA (MUL then ADD). However (a+b)* c cannot be translated into an FMA (ADD then MUL). All innermost loops were analyzed. Info: Rerun CQA with conf=hint,expert to display more advanced reports or conf=all to display them with default reports. |
Comme ça ne change rien, il ne faut pas s'attendre à de meilleures performances :
time ./Program/GrayScottReaction/Vectorized/vectorized_gray_scott_3x3_v3 -r 1080 -c 1920 -n 5 -e 6800 simulateImage : nbImage = 5, nbRow = 1080, nbCol = 1920 [========================================================================================================================================================|100%] 0s Done
real 1m10,070s user 1m9,963s sys 0m0,056s
Effectivement, ce n'est pas mieux. Même si maqao nous dit que l'on est à 91% du pic de performance et plus à 88% comme dans la V2.
Le rapport expert de Maqao
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 |
maqao cqa --fct-loops=grayscott_propagation conf=expert ./GrayScottCompute/Vectorized/libgray_scott_vectorized_3x3_v3.so Target processor is: 12th generation Intel Core processors and Intel Xeon processor product family based on Alder Lake microarchitecture (x86_64 architecture). Section 1: Function: grayscott_propagation(float*, float*, float const*, float const*, long, long, float const*, float, float, float, float, float) =================================================================================================================================================== Code for this function has been specialized for Alder Lake, Raptor Lake. For execution on another machine, recompile on it or with explicit target (example for a Haswell machine: use -march=haswell, see compiler manual for full list). These loops are supposed to be defined in: XXX/Examples/GrayScottCompute/Vectorized/vectorized_propagation_3x3_v3.cpp Section 1.1: Source loop ending at line 112 =========================================== Composition and unrolling ------------------------- It is composed of the following loops [ID (first-last source line)]: - 0 (59-112) - 2 (59-112) and is unrolled by 8 (including vectorization). The following loops are considered as: - unrolled and/or vectorized main: 2 - peel or tail: 0 The analysis will be displayed for the unrolled and/or vectorized loops: 2 Section 1.1.1: Binary (unrolled and/or vectorized) loop #2 ========================================================== The loop is defined in XXX/Examples/GrayScottCompute/Vectorized/vectorized_propagation_3x3_v3.cpp:59-112. It is main loop of related source loop which is unrolled by 8 (including vectorization). 91% of peak computational performance is used (29.18 out of 32.00 FLOP per cycle (GFLOPS @ 1GHz)) General properties ------------------ nb instructions : 83 nb uops : 82 loop length : 465 used x86 registers : 15 used mmx registers : 0 used xmm registers : 0 used ymm registers : 16 used zmm registers : 0 nb stack references: 16 ADD-SUB / MUL ratio: 2.09 Front-end --------- ASSUMED MACRO FUSION FIT IN UOP CACHE micro-operation queue: 13.67 cycles front end : 13.67 cycles Back-end -------- | P0 | P1 | P2 | P3 | P4 | P5 | P6 | P7 | P8 | P9 | P10 | P11 ------------------------------------------------------------------------------------------------ uops | 17.00 | 17.00 | 11.33 | 11.33 | 1.50 | 14.00 | 1.00 | 1.50 | 1.50 | 1.50 | 0.00 | 11.33 cycles | 17.00 | 17.00 | 11.33 | 11.33 | 1.50 | 14.00 | 1.00 | 1.50 | 1.50 | 1.50 | 0.00 | 11.33 Cycles executing div or sqrt instructions: NA Longest recurrence chain latency (RecMII): 1.00 Cycles summary -------------- Front-end : 13.67 Dispatch : 17.00 Data deps.: 1.00 Overall L1: 17.00 Vectorization ratios -------------------- all : 100% load : 100% store : 100% mul : 100% add-sub : 100% fma : 100% div/sqrt: NA (no div/sqrt vectorizable/vectorized instructions) other : NA (no other vectorizable/vectorized instructions) Vector efficiency ratios ------------------------ all : 100% load : 100% store : 100% mul : 100% add-sub : 100% fma : 100% div/sqrt: NA (no div/sqrt vectorizable/vectorized instructions) other : NA (no other vectorizable/vectorized instructions) Cycles and memory resources usage --------------------------------- Assuming all data fit into the L1 cache, each iteration of the binary loop takes 17.00 cycles. At this rate: - 53% of peak load performance is reached (51.29 out of 96.00 bytes loaded per cycle (GB/s @ 1GHz)) - 8% of peak store performance is reached (5.65 out of 64.00 bytes stored per cycle (GB/s @ 1GHz)) Front-end bottlenecks --------------------- Found no such bottlenecks. ASM code -------- In the binary file, the address of the loop is: 1580 Instruction | Nb FU | P0 | P1 | P2 | P3 | P4 | P5 | P6 | P7 | P8 | P9 | P10 | P11 | Latency | Recip. throughput ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- MOV 0x250(%RSP),%R11 | 1 | 0 | 0 | 0.33 | 0.33 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.33 | 1 | 0.33 VMOVUPS (%RBX,%RAX,1),%YMM7 | 1 | 0 | 0 | 0.33 | 0.33 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.33 | 0-1 | 0.33 VMOVUPS (%R11,%RAX,1),%YMM2 | 1 | 0 | 0 | 0.33 | 0.33 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.33 | 0-1 | 0.33 MOV 0x248(%RSP),%R11 | 1 | 0 | 0 | 0.33 | 0.33 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.33 | 1 | 0.33 VMOVAPS 0x1a8(%RSP),%YMM8 | 1 | 0 | 0 | 0.33 | 0.33 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.33 | 0-1 | 0.33 VMOVUPS (%R11,%RAX,1),%YMM1 | 1 | 0 | 0 | 0.33 | 0.33 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.33 | 0-1 | 0.33 MOV 0x238(%RSP),%R11 | 1 | 0 | 0 | 0.33 | 0.33 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.33 | 1 | 0.33 VSUBPS %YMM1,%YMM7,%YMM0 | 1 | 0 | 0.50 | 0 | 0 | 0 | 0.50 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 0.50 VMOVUPS (%R11,%RAX,1),%YMM6 | 1 | 0 | 0 | 0.33 | 0.33 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.33 | 0-1 | 0.33 MOV 0x2b0(%RSP),%R11 | 1 | 0 | 0 | 0.33 | 0.33 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.33 | 1 | 0.33 VSUBPS %YMM1,%YMM6,%YMM4 | 1 | 0 | 0.50 | 0 | 0 | 0 | 0.50 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 0.50 VMOVUPS (%R11,%RAX,1),%YMM7 | 1 | 0 | 0 | 0.33 | 0.33 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.33 | 0-1 | 0.33 MOV 0x240(%RSP),%R11 | 1 | 0 | 0 | 0.33 | 0.33 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.33 | 1 | 0.33 VSUBPS %YMM2,%YMM7,%YMM3 | 1 | 0 | 0.50 | 0 | 0 | 0 | 0.50 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 0.50 VMOVUPS (%R11,%RAX,1),%YMM7 | 1 | 0 | 0 | 0.33 | 0.33 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.33 | 0-1 | 0.33 VMOVAPS 0x188(%RSP),%YMM6 | 1 | 0 | 0 | 0.33 | 0.33 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.33 | 0-1 | 0.33 VSUBPS %YMM2,%YMM7,%YMM5 | 1 | 0 | 0.50 | 0 | 0 | 0 | 0.50 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 0.50 VMULPS %YMM6,%YMM4,%YMM4 | 1 | 0.50 | 0.50 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 4 | 0.50 MOV 0x230(%RSP),%R11 | 1 | 0 | 0 | 0.33 | 0.33 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.33 | 1 | 0.33 VMULPS %YMM6,%YMM5,%YMM5 | 1 | 0.50 | 0.50 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 4 | 0.50 VMOVUPS (%R10,%RAX,1),%YMM6 | 1 | 0 | 0 | 0.33 | 0.33 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.33 | 0-1 | 0.33 VMOVUPS (%RCX,%RAX,1),%YMM9 | 1 | 0 | 0 | 0.33 | 0.33 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.33 | 0-1 | 0.33 VFMADD132PS %YMM10,%YMM4,%YMM0 | 1 | 0.50 | 0.50 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 4 | 0.50 VMOVUPS (%R11,%RAX,1),%YMM4 | 1 | 0 | 0 | 0.33 | 0.33 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.33 | 0-1 | 0.33 VFMADD132PS %YMM10,%YMM5,%YMM3 | 1 | 0.50 | 0.50 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 4 | 0.50 VSUBPS %YMM1,%YMM6,%YMM5 | 1 | 0 | 0.50 | 0 | 0 | 0 | 0.50 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 0.50 VMOVUPS (%R15,%RAX,1),%YMM6 | 1 | 0 | 0 | 0.33 | 0.33 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.33 | 0-1 | 0.33 VSUBPS %YMM2,%YMM4,%YMM7 | 1 | 0 | 0.50 | 0 | 0 | 0 | 0.50 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 0.50 VSUBPS %YMM1,%YMM6,%YMM6 | 1 | 0 | 0.50 | 0 | 0 | 0 | 0.50 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 0.50 VMOVAPS %YMM4,0x288(%RSP) | 1 | 0 | 0 | 0 | 0 | 0.50 | 0 | 0 | 0.50 | 0.50 | 0.50 | 0 | 0 | 0-1 | 0.50 VMOVUPS (%R14,%RAX,1),%YMM4 | 1 | 0 | 0 | 0.33 | 0.33 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.33 | 0-1 | 0.33 VMULPS %YMM15,%YMM6,%YMM6 | 1 | 0.50 | 0.50 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 4 | 0.50 VSUBPS %YMM2,%YMM4,%YMM4 | 1 | 0 | 0.50 | 0 | 0 | 0 | 0.50 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 0.50 VSUBPS %YMM2,%YMM9,%YMM9 | 1 | 0 | 0.50 | 0 | 0 | 0 | 0.50 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 0.50 VMULPS %YMM15,%YMM4,%YMM4 | 1 | 0.50 | 0.50 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 4 | 0.50 VFMADD132PS %YMM8,%YMM6,%YMM5 | 1 | 0.50 | 0.50 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 4 | 0.50 VMOVUPS (%R8,%RAX,1),%YMM6 | 1 | 0 | 0 | 0.33 | 0.33 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.33 | 0-1 | 0.33 VMULPS %YMM11,%YMM9,%YMM9 | 1 | 0.50 | 0.50 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 4 | 0.50 VFMADD231PS %YMM8,%YMM7,%YMM4 | 1 | 0.50 | 0.50 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 4 | 0.50 VSUBPS %YMM1,%YMM6,%YMM7 | 1 | 0 | 0.50 | 0 | 0 | 0 | 0.50 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 0.50 VMOVUPS (%R13,%RAX,1),%YMM6 | 1 | 0 | 0 | 0.33 | 0.33 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.33 | 0-1 | 0.33 VADDPS %YMM0,%YMM5,%YMM5 | 1 | 0 | 0.50 | 0 | 0 | 0 | 0.50 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 0.50 VSUBPS %YMM1,%YMM6,%YMM0 | 1 | 0 | 0.50 | 0 | 0 | 0 | 0.50 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 0.50 VADDPS %YMM3,%YMM4,%YMM4 | 1 | 0 | 0.50 | 0 | 0 | 0 | 0.50 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 0.50 VMOVUPS (%R12,%RAX,1),%YMM3 | 1 | 0 | 0 | 0.33 | 0.33 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.33 | 0-1 | 0.33 VMULPS %YMM11,%YMM0,%YMM0 | 1 | 0.50 | 0.50 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 4 | 0.50 VSUBPS %YMM2,%YMM3,%YMM3 | 1 | 0 | 0.50 | 0 | 0 | 0 | 0.50 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 0.50 VMOVUPS (%R9,%RAX,1),%YMM6 | 1 | 0 | 0 | 0.33 | 0.33 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.33 | 0-1 | 0.33 VMULPS %YMM11,%YMM3,%YMM3 | 1 | 0.50 | 0.50 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 4 | 0.50 VFMADD132PS %YMM14,%YMM0,%YMM7 | 1 | 0.50 | 0.50 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 4 | 0.50 VMOVUPS (%RDX,%RAX,1),%YMM0 | 1 | 0 | 0 | 0.33 | 0.33 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.33 | 0-1 | 0.33 VSUBPS %YMM2,%YMM6,%YMM6 | 1 | 0 | 0.50 | 0 | 0 | 0 | 0.50 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 0.50 VSUBPS %YMM1,%YMM0,%YMM8 | 1 | 0 | 0.50 | 0 | 0 | 0 | 0.50 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 0.50 VMOVUPS (%RDI,%RAX,1),%YMM0 | 1 | 0 | 0 | 0.33 | 0.33 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.33 | 0-1 | 0.33 VFMADD132PS %YMM14,%YMM3,%YMM6 | 1 | 0.50 | 0.50 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 4 | 0.50 VMULPS %YMM11,%YMM8,%YMM8 | 1 | 0.50 | 0.50 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 4 | 0.50 VMOVUPS (%RSI,%RAX,1),%YMM3 | 1 | 0 | 0 | 0.33 | 0.33 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.33 | 0-1 | 0.33 VSUBPS %YMM2,%YMM0,%YMM0 | 1 | 0 | 0.50 | 0 | 0 | 0 | 0.50 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 0.50 VSUBPS %YMM1,%YMM3,%YMM3 | 1 | 0 | 0.50 | 0 | 0 | 0 | 0.50 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 0.50 VFMADD132PS %YMM13,%YMM9,%YMM0 | 1 | 0.50 | 0.50 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 4 | 0.50 VFMADD132PS %YMM13,%YMM8,%YMM3 | 1 | 0.50 | 0.50 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 4 | 0.50 VADDPS %YMM6,%YMM0,%YMM0 | 1 | 0 | 0.50 | 0 | 0 | 0 | 0.50 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 0.50 VADDPS %YMM7,%YMM3,%YMM3 | 1 | 0 | 0.50 | 0 | 0 | 0 | 0.50 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 0.50 VMOVAPS 0x927(%RIP),%YMM7 | 1 | 0 | 0 | 0.33 | 0.33 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.33 | 0-1 | 0.33 VADDPS %YMM4,%YMM0,%YMM0 | 1 | 0 | 0.50 | 0 | 0 | 0 | 0.50 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 0.50 VADDPS %YMM5,%YMM3,%YMM3 | 1 | 0 | 0.50 | 0 | 0 | 0 | 0.50 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 0.50 VMULPS %YMM2,%YMM1,%YMM5 | 1 | 0.50 | 0.50 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 4 | 0.50 VMULPS 0x168(%RSP),%YMM0,%YMM0 | 1 | 0.50 | 0.50 | 0.33 | 0.33 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.33 | 4 | 0.50 VMULPS 0x1e8(%RSP),%YMM3,%YMM3 | 1 | 0.50 | 0.50 | 0.33 | 0.33 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.33 | 4 | 0.50 VSUBPS %YMM2,%YMM7,%YMM4 | 1 | 0 | 0.50 | 0 | 0 | 0 | 0.50 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 0.50 VFNMADD231PS %YMM1,%YMM5,%YMM0 | 1 | 0.50 | 0.50 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 4 | 0.50 VFMADD132PS %YMM1,%YMM3,%YMM5 | 1 | 0.50 | 0.50 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 4 | 0.50 VFMADD231PS 0x1c8(%RSP),%YMM4,%YMM0 | 1 | 0.50 | 0.50 | 0.33 | 0.33 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.33 | 4 | 0.50 VFNMADD231PS 0x208(%RSP),%YMM1,%YMM5 | 1 | 0.50 | 0.50 | 0.33 | 0.33 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.33 | 4 | 0.50 MOV 0x260(%RSP),%R11 | 1 | 0 | 0 | 0.33 | 0.33 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.33 | 1 | 0.33 VFMADD132PS %YMM12,%YMM2,%YMM0 | 1 | 0.50 | 0.50 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 4 | 0.50 VFMADD132PS %YMM12,%YMM1,%YMM5 | 1 | 0.50 | 0.50 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 4 | 0.50 VMOVUPS %YMM0,(%R11,%RAX,1) | 1 | 0 | 0 | 0 | 0 | 0.50 | 0 | 0 | 0.50 | 0.50 | 0.50 | 0 | 0 | 0-1 | 0.50 MOV 0x258(%RSP),%R11 | 1 | 0 | 0 | 0.33 | 0.33 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.33 | 1 | 0.33 VMOVUPS %YMM5,(%R11,%RAX,1) | 1 | 0 | 0 | 0 | 0 | 0.50 | 0 | 0 | 0.50 | 0.50 | 0.50 | 0 | 0 | 0-1 | 0.50 ADD $0x20,%RAX | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0.17 CMP %RAX,0x228(%RSP) | 1 | 0.20 | 0.20 | 0.33 | 0.33 | 0 | 0.20 | 0.20 | 0 | 0 | 0 | 0.20 | 0.33 | 1 | 0.33 JNE 1580 <_Z21grayscott_propagationPfS_PKfS1_llS1_fffff+0x410> | 1 | 0.50 | 0 | 0 | 0 | 0 | 0 | 0.50 | 0 | 0 | 0 | 0 | 0 | 0 | 0.50 All innermost loops were analyzed. |
Moralité, il faut toujours laisser un peu de liberté au compilateur pour qu'il puisse optimiser les calculs correctement.