Compilation O3

3.7.1.4 : Compilation O3

Let's call maqao to analyse the hadamard_product function :

Here is the full output :

maqao.intel64 cqa --fct-loops=hadamard_product ./1-HadamardProduct/hadamard_product_O3
Target processor is: Intel Kaby Lake Core Processors (x86_64/Kaby Lake micro-architecture).

Info: No innermost loops in the function _GLOBAL__sub_I__Z16hadamard_productPfPKfS1_m
Section 1: Function: hadamard_product(float*, float const*, float const*, unsigned long)
========================================================================================

Code for this function has been compiled to run on any x86-64 processor (SSE2, 2004). It is not optimized for later processors (AVX etc.).
These loops are supposed to be defined in: Examples/1-HadamardProduct/main.cpp

Section 1.1: Source loop ending at line 20
==========================================

Composition and unrolling
-------------------------
It is composed of the following loops [ID (first-last source line)]:
 - 0 (19-20)
 - 1 (20-20)
and is unrolled by 4 (including vectorization).

The following loops are considered as:
 - unrolled and/or vectorized main: 1
 - peel or tail: 0
The analysis will be displayed for the unrolled and/or vectorized loops: 1

Section 1.1.1: Binary (unrolled and/or vectorized) loop #1
==========================================================

The loop is defined in Examples/1-HadamardProduct/main.cpp:20-20.

It is main loop of related source loop which is unrolled by 4 (including vectorization).
12% of peak computational performance is used (4.00 out of 32.00 FLOP per cycle (GFLOPS @ 1GHz))

Vectorization
-------------
Your loop is vectorized, but using only 128 out of 256 bits (SSE/AVX-128 instructions on AVX/AVX2 processors).
By fully vectorizing your loop, you can lower the cost of an iteration from 1.00 to 0.50 cycles (2.00x speedup).
All SSE/AVX instructions are used in vector version (process two or more data elements in vector registers).
Since your execution units are vector units, only a fully vectorized loop can use their full power.

Workaround(s):
 - Recompile with march=skylake.
CQA target is Core_7x_V2 (Intel Kaby Lake Core Processors) but specialization flags are -march=x86-64
 - Use vector aligned instructions:
  1) align your arrays on 32 bytes boundaries: replace { void *p = malloc (size); } with { void *p; posix_memalign (&p, 32, size); }.
  2) inform your compiler that your arrays are vector aligned: if array 'foo' is 32 bytes-aligned, define a pointer 'p_foo' as __builtin_assume_aligned (foo, 32) and use it instead of 'foo' in the loop.


Execution units bottlenecks
---------------------------
Performance is limited by:
 - execution of FP multiply or FMA (fused multiply-add) operations (the FP multiply/FMA unit is a bottleneck)
 - reading data from caches/RAM (load units are a bottleneck)
 - writing data to caches/RAM (the store unit is a bottleneck)

Workaround(s):
 - Reduce the number of FP multiply/FMA instructions
 - Read less array elements
 - Write less array elements
 - Provide more information to your compiler:
  * hardcode the bounds of the corresponding 'for' loop



All innermost loops were analyzed.

Info: Rerun CQA with conf=hint,expert to display more advanced reports or conf=all to display them with default reports.

Let's rerun it with the conf=all option (this is mainly for experts but you know how to get this information) :

Here is the full output :

maqao.intel64 cqa conf=all --fct-loops=hadamard_product ./1-HadamardProduct/hadamard_product_O3
Target processor is: Intel Kaby Lake Core Processors (x86_64/Kaby Lake micro-architecture).

Info: No innermost loops in the function _GLOBAL__sub_I__Z16hadamard_productPfPKfS1_m
Section 1: Function: hadamard_product(float*, float const*, float const*, unsigned long)
========================================================================================

Code for this function has been compiled to run on any x86-64 processor (SSE2, 2004). It is not optimized for later processors (AVX etc.).
These loops are supposed to be defined in: Examples/1-HadamardProduct/main.cpp

Section 1.1: Source loop ending at line 20
==========================================

Composition and unrolling
-------------------------
It is composed of the following loops [ID (first-last source line)]:
 - 0 (19-20)
 - 1 (20-20)
and is unrolled by 4 (including vectorization).

The following loops are considered as:
 - unrolled and/or vectorized main: 1
 - peel or tail: 0
The analysis will be displayed for the unrolled and/or vectorized loops: 1

Section 1.1.1: Binary (unrolled and/or vectorized) loop #1
==========================================================

The loop is defined in Examples/1-HadamardProduct/main.cpp:20-20.

It is main loop of related source loop which is unrolled by 4 (including vectorization).
12% of peak computational performance is used (4.00 out of 32.00 FLOP per cycle (GFLOPS @ 1GHz))

Vectorization
-------------
Your loop is vectorized, but using only 128 out of 256 bits (SSE/AVX-128 instructions on AVX/AVX2 processors).
By fully vectorizing your loop, you can lower the cost of an iteration from 1.00 to 0.50 cycles (2.00x speedup).
All SSE/AVX instructions are used in vector version (process two or more data elements in vector registers).
Since your execution units are vector units, only a fully vectorized loop can use their full power.

Workaround(s):
 - Recompile with march=skylake.
CQA target is Core_7x_V2 (Intel Kaby Lake Core Processors) but specialization flags are -march=x86-64
 - Use vector aligned instructions:
  1) align your arrays on 32 bytes boundaries: replace { void *p = malloc (size); } with { void *p; posix_memalign (&p, 32, size); }.
  2) inform your compiler that your arrays are vector aligned: if array 'foo' is 32 bytes-aligned, define a pointer 'p_foo' as __builtin_assume_aligned (foo, 32) and use it instead of 'foo' in the loop.


Execution units bottlenecks
---------------------------
Performance is limited by:
 - execution of FP multiply or FMA (fused multiply-add) operations (the FP multiply/FMA unit is a bottleneck)
 - reading data from caches/RAM (load units are a bottleneck)
 - writing data to caches/RAM (the store unit is a bottleneck)

Workaround(s):
 - Reduce the number of FP multiply/FMA instructions
 - Read less array elements
 - Write less array elements
 - Provide more information to your compiler:
  * hardcode the bounds of the corresponding 'for' loop


Vector unaligned load/store instructions
----------------------------------------
Detected 2 suboptimal vector unaligned load/store instructions.

 - MOVUPS: 2 occurrences

Workaround(s):
 - Recompile with march=skylake.
CQA target is Core_7x_V2 (Intel Kaby Lake Core Processors) but specialization flags are -march=x86-64
 - Use vector aligned instructions:
  1) align your arrays on 32 bytes boundaries: replace { void *p = malloc (size); } with { void *p; posix_memalign (&p, 32, size); }.
  2) inform your compiler that your arrays are vector aligned: if array 'foo' is 32 bytes-aligned, define a pointer 'p_foo' as __builtin_assume_aligned (foo, 32) and use it instead of 'foo' in the loop.


Type of elements and instruction set
------------------------------------
1 SSE or AVX instructions are processing arithmetic or math operations on single precision FP elements in vector mode (four at a time).


Matching between your loop (in the source code) and the binary loop
-------------------------------------------------------------------
The binary loop is composed of 4 FP arithmetical operations:
 - 4: multiply
The binary loop is loading 32 bytes (8 single precision FP elements).
The binary loop is storing 16 bytes (4 single precision FP elements).

Arithmetic intensity
--------------------
Arithmetic intensity is 0.08 FP operations per loaded or stored byte.

ASM code
--------
In the binary file, the address of the loop is: 1110

Instruction                                  | Nb FU | P0   | P1   | P2   | P3   | P4 | P5   | P6   | P7   | Latency | Recip. throughput
----------------------------------------------------------------------------------------------------------------------------------------
MOVUPS (%R11,%RAX,1),%XMM0                   | 1     | 0    | 0    | 0.50 | 0.50 | 0  | 0    | 0    | 0    | 3       | 0.50
ADD $0x1,%R9                                 | 1     | 0.25 | 0.25 | 0    | 0    | 0  | 0.25 | 0.25 | 0    | 1       | 0.25
MULPS (%RBX,%RAX,1),%XMM0                    | 1     | 0.50 | 0.50 | 0.50 | 0.50 | 0  | 0    | 0    | 0    | 4       | 1
MOVUPS %XMM0,(%R8,%RAX,1)                    | 1     | 0    | 0    | 0.33 | 0.33 | 1  | 0    | 0    | 0.33 | 1       | 1
ADD $0x10,%RAX                               | 1     | 0.25 | 0.25 | 0    | 0    | 0  | 0.25 | 0.25 | 0    | 1       | 0.25
CMP %RBP,%R9                                 | 1     | 0.25 | 0.25 | 0    | 0    | 0  | 0.25 | 0.25 | 0    | 1       | 0.25
JB 1110 <_Z16hadamard_productPfPKfS1_m+0xd0> | 1     | 0.50 | 0    | 0    | 0    | 0  | 0    | 0.50 | 0    | 0       | 0.50


General properties
------------------
nb instructions    : 7
nb uops            : 6
loop length        : 27
used x86 registers : 6
used mmx registers : 0
used xmm registers : 1
used ymm registers : 0
used zmm registers : 0
nb stack references: 0


Front-end
---------
ASSUMED MACRO FUSION
FIT IN UOP CACHE
micro-operation queue: 1.00 cycles
front end            : 1.00 cycles


Back-end
--------
       | P0   | P1   | P2   | P3   | P4   | P5   | P6   | P7
--------------------------------------------------------------
uops   | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00
cycles | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00

Cycles executing div or sqrt instructions: NA
Longest recurrence chain latency (RecMII): 1.00


Cycles summary
--------------
Front-end : 1.00
Dispatch  : 1.00
Data deps.: 1.00
Overall L1: 1.00


Vectorization ratios
--------------------
all    : 100%
load   : 100%
store  : 100%
mul    : 100%
add-sub: NA (no add-sub vectorizable/vectorized instructions)
other  : NA (no other vectorizable/vectorized instructions)


Vector efficiency ratios
------------------------
all    : 50%
load   : 50%
store  : 50%
mul    : 50%
add-sub: NA (no add-sub vectorizable/vectorized instructions)
other  : NA (no other vectorizable/vectorized instructions)


Cycles and memory resources usage
---------------------------------
Assuming all data fit into the L1 cache, each iteration of the binary loop takes 1.00 cycles. At this rate:
 - 50% of peak load performance is reached (32.00 out of 64.00 bytes loaded per cycle (GB/s @ 1GHz))
 - 50% of peak store performance is reached (16.00 out of 32.00 bytes stored per cycle (GB/s @ 1GHz))


Front-end bottlenecks
---------------------
Performance is limited by instruction throughput (loading/decoding program instructions to execution core) (front-end is a bottleneck).



All innermost loops were analyzed.