3.7.1.4 : Compilation O3

Let's call maqao to analyse the hadamard_product function :


Here is the full output :
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
maqao.intel64 cqa --fct-loops=hadamard_product ./1-HadamardProduct/hadamard_product_O3
Target processor is: Intel Kaby Lake Core Processors (x86_64/Kaby Lake micro-architecture).

Info: No innermost loops in the function _GLOBAL__sub_I__Z16hadamard_productPfPKfS1_m
Section 1: Function: hadamard_product(float*, float const*, float const*, unsigned long)
========================================================================================

Code for this function has been compiled to run on any x86-64 processor (SSE2, 2004). It is not optimized for later processors (AVX etc.).
These loops are supposed to be defined in: Examples/1-HadamardProduct/main.cpp

Section 1.1: Source loop ending at line 20
==========================================

Composition and unrolling
-------------------------
It is composed of the following loops [ID (first-last source line)]:
 - 0 (19-20)
 - 1 (20-20)
and is unrolled by 4 (including vectorization).

The following loops are considered as:
 - unrolled and/or vectorized main: 1
 - peel or tail: 0
The analysis will be displayed for the unrolled and/or vectorized loops: 1

Section 1.1.1: Binary (unrolled and/or vectorized) loop #1
==========================================================

The loop is defined in Examples/1-HadamardProduct/main.cpp:20-20.

It is main loop of related source loop which is unrolled by 4 (including vectorization).
12% of peak computational performance is used (4.00 out of 32.00 FLOP per cycle (GFLOPS @ 1GHz))

Vectorization
-------------
Your loop is vectorized, but using only 128 out of 256 bits (SSE/AVX-128 instructions on AVX/AVX2 processors).
By fully vectorizing your loop, you can lower the cost of an iteration from 1.00 to 0.50 cycles (2.00x speedup).
All SSE/AVX instructions are used in vector version (process two or more data elements in vector registers).
Since your execution units are vector units, only a fully vectorized loop can use their full power.

Workaround(s):
 - Recompile with march=skylake.
CQA target is Core_7x_V2 (Intel Kaby Lake Core Processors) but specialization flags are -march=x86-64
 - Use vector aligned instructions:
  1) align your arrays on 32 bytes boundaries: replace { void *p = malloc (size); } with { void *p; posix_memalign (&p, 32, size); }.
  2) inform your compiler that your arrays are vector aligned: if array 'foo' is 32 bytes-aligned, define a pointer 'p_foo' as __builtin_assume_aligned (foo, 32) and use it instead of 'foo' in the loop.


Execution units bottlenecks
---------------------------
Performance is limited by:
 - execution of FP multiply or FMA (fused multiply-add) operations (the FP multiply/FMA unit is a bottleneck)
 - reading data from caches/RAM (load units are a bottleneck)
 - writing data to caches/RAM (the store unit is a bottleneck)

Workaround(s):
 - Reduce the number of FP multiply/FMA instructions
 - Read less array elements
 - Write less array elements
 - Provide more information to your compiler:
  * hardcode the bounds of the corresponding 'for' loop



All innermost loops were analyzed.

Info: Rerun CQA with conf=hint,expert to display more advanced reports or conf=all to display them with default reports.


Let's rerun it with the conf=all option (this is mainly for experts but you know how to get this information) :


Here is the full output :
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
maqao.intel64 cqa conf=all --fct-loops=hadamard_product ./1-HadamardProduct/hadamard_product_O3
Target processor is: Intel Kaby Lake Core Processors (x86_64/Kaby Lake micro-architecture).

Info: No innermost loops in the function _GLOBAL__sub_I__Z16hadamard_productPfPKfS1_m
Section 1: Function: hadamard_product(float*, float const*, float const*, unsigned long)
========================================================================================

Code for this function has been compiled to run on any x86-64 processor (SSE2, 2004). It is not optimized for later processors (AVX etc.).
These loops are supposed to be defined in: Examples/1-HadamardProduct/main.cpp

Section 1.1: Source loop ending at line 20
==========================================

Composition and unrolling
-------------------------
It is composed of the following loops [ID (first-last source line)]:
 - 0 (19-20)
 - 1 (20-20)
and is unrolled by 4 (including vectorization).

The following loops are considered as:
 - unrolled and/or vectorized main: 1
 - peel or tail: 0
The analysis will be displayed for the unrolled and/or vectorized loops: 1

Section 1.1.1: Binary (unrolled and/or vectorized) loop #1
==========================================================

The loop is defined in Examples/1-HadamardProduct/main.cpp:20-20.

It is main loop of related source loop which is unrolled by 4 (including vectorization).
12% of peak computational performance is used (4.00 out of 32.00 FLOP per cycle (GFLOPS @ 1GHz))

Vectorization
-------------
Your loop is vectorized, but using only 128 out of 256 bits (SSE/AVX-128 instructions on AVX/AVX2 processors).
By fully vectorizing your loop, you can lower the cost of an iteration from 1.00 to 0.50 cycles (2.00x speedup).
All SSE/AVX instructions are used in vector version (process two or more data elements in vector registers).
Since your execution units are vector units, only a fully vectorized loop can use their full power.

Workaround(s):
 - Recompile with march=skylake.
CQA target is Core_7x_V2 (Intel Kaby Lake Core Processors) but specialization flags are -march=x86-64
 - Use vector aligned instructions:
  1) align your arrays on 32 bytes boundaries: replace { void *p = malloc (size); } with { void *p; posix_memalign (&p, 32, size); }.
  2) inform your compiler that your arrays are vector aligned: if array 'foo' is 32 bytes-aligned, define a pointer 'p_foo' as __builtin_assume_aligned (foo, 32) and use it instead of 'foo' in the loop.


Execution units bottlenecks
---------------------------
Performance is limited by:
 - execution of FP multiply or FMA (fused multiply-add) operations (the FP multiply/FMA unit is a bottleneck)
 - reading data from caches/RAM (load units are a bottleneck)
 - writing data to caches/RAM (the store unit is a bottleneck)

Workaround(s):
 - Reduce the number of FP multiply/FMA instructions
 - Read less array elements
 - Write less array elements
 - Provide more information to your compiler:
  * hardcode the bounds of the corresponding 'for' loop


Vector unaligned load/store instructions
----------------------------------------
Detected 2 suboptimal vector unaligned load/store instructions.

 - MOVUPS: 2 occurrences

Workaround(s):
 - Recompile with march=skylake.
CQA target is Core_7x_V2 (Intel Kaby Lake Core Processors) but specialization flags are -march=x86-64
 - Use vector aligned instructions:
  1) align your arrays on 32 bytes boundaries: replace { void *p = malloc (size); } with { void *p; posix_memalign (&p, 32, size); }.
  2) inform your compiler that your arrays are vector aligned: if array 'foo' is 32 bytes-aligned, define a pointer 'p_foo' as __builtin_assume_aligned (foo, 32) and use it instead of 'foo' in the loop.


Type of elements and instruction set
------------------------------------
1 SSE or AVX instructions are processing arithmetic or math operations on single precision FP elements in vector mode (four at a time).


Matching between your loop (in the source code) and the binary loop
-------------------------------------------------------------------
The binary loop is composed of 4 FP arithmetical operations:
 - 4: multiply
The binary loop is loading 32 bytes (8 single precision FP elements).
The binary loop is storing 16 bytes (4 single precision FP elements).

Arithmetic intensity
--------------------
Arithmetic intensity is 0.08 FP operations per loaded or stored byte.

ASM code
--------
In the binary file, the address of the loop is: 1110

Instruction                                  | Nb FU | P0   | P1   | P2   | P3   | P4 | P5   | P6   | P7   | Latency | Recip. throughput
----------------------------------------------------------------------------------------------------------------------------------------
MOVUPS (%R11,%RAX,1),%XMM0                   | 1     | 0    | 0    | 0.50 | 0.50 | 0  | 0    | 0    | 0    | 3       | 0.50
ADD $0x1,%R9                                 | 1     | 0.25 | 0.25 | 0    | 0    | 0  | 0.25 | 0.25 | 0    | 1       | 0.25
MULPS (%RBX,%RAX,1),%XMM0                    | 1     | 0.50 | 0.50 | 0.50 | 0.50 | 0  | 0    | 0    | 0    | 4       | 1
MOVUPS %XMM0,(%R8,%RAX,1)                    | 1     | 0    | 0    | 0.33 | 0.33 | 1  | 0    | 0    | 0.33 | 1       | 1
ADD $0x10,%RAX                               | 1     | 0.25 | 0.25 | 0    | 0    | 0  | 0.25 | 0.25 | 0    | 1       | 0.25
CMP %RBP,%R9                                 | 1     | 0.25 | 0.25 | 0    | 0    | 0  | 0.25 | 0.25 | 0    | 1       | 0.25
JB 1110 <_Z16hadamard_productPfPKfS1_m+0xd0> | 1     | 0.50 | 0    | 0    | 0    | 0  | 0    | 0.50 | 0    | 0       | 0.50


General properties
------------------
nb instructions    : 7
nb uops            : 6
loop length        : 27
used x86 registers : 6
used mmx registers : 0
used xmm registers : 1
used ymm registers : 0
used zmm registers : 0
nb stack references: 0


Front-end
---------
ASSUMED MACRO FUSION
FIT IN UOP CACHE
micro-operation queue: 1.00 cycles
front end            : 1.00 cycles


Back-end
--------
       | P0   | P1   | P2   | P3   | P4   | P5   | P6   | P7
--------------------------------------------------------------
uops   | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00
cycles | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00

Cycles executing div or sqrt instructions: NA
Longest recurrence chain latency (RecMII): 1.00


Cycles summary
--------------
Front-end : 1.00
Dispatch  : 1.00
Data deps.: 1.00
Overall L1: 1.00


Vectorization ratios
--------------------
all    : 100%
load   : 100%
store  : 100%
mul    : 100%
add-sub: NA (no add-sub vectorizable/vectorized instructions)
other  : NA (no other vectorizable/vectorized instructions)


Vector efficiency ratios
------------------------
all    : 50%
load   : 50%
store  : 50%
mul    : 50%
add-sub: NA (no add-sub vectorizable/vectorized instructions)
other  : NA (no other vectorizable/vectorized instructions)


Cycles and memory resources usage
---------------------------------
Assuming all data fit into the L1 cache, each iteration of the binary loop takes 1.00 cycles. At this rate:
 - 50% of peak load performance is reached (32.00 out of 64.00 bytes loaded per cycle (GB/s @ 1GHz))
 - 50% of peak store performance is reached (16.00 out of 32.00 bytes stored per cycle (GB/s @ 1GHz))


Front-end bottlenecks
---------------------
Performance is limited by instruction throughput (loading/decoding program instructions to execution core) (front-end is a bottleneck).



All innermost loops were analyzed.