Classical intrinsic version

3.7.3.1 : Classical intrinsic version

Let's call maqao to analyse the hadamard_product function :

Here is the full output :

maqao.intel64 cqa --fct-loops=hadamard_product ./1-HadamardProduct/hadamard_product_intrinsics
Target processor is: Intel Kaby Lake Core Processors (x86_64/Kaby Lake micro-architecture).

Info: No innermost loops in the function _GLOBAL__sub_I__Z16hadamard_productPfPKfS1_m
Section 1: Function: hadamard_product(float*, float const*, float const*, unsigned long)
========================================================================================

Code for this function has been specialized for Broadwell. For execution on another machine, recompile on it or with explicit target
(example for a Haswell machine: use -march=haswell, see compiler manual for full list).
These loops are supposed to be defined in: Examples/1-HadamardProduct/main_intrinsics.cpp

Section 1.1: Source loop ending at line 24
==========================================

Composition and unrolling
-------------------------
It is composed of the loop 0
and is not unrolled or unrolled with no peel/tail loop.

Section 1.1.1: Binary loop #0
=============================

The loop is defined in:
 - /usr/lib/gcc/x86_64-linux-gnu/7/include/avxintrin.h: 319-879
 - Examples/1-HadamardProduct/main_intrinsics.cpp: 24-24


The related source loop is not unrolled or unrolled with no peel/tail loop.
25% of peak computational performance is used (8.00 out of 32.00 FLOP per cycle (GFLOPS @ 1GHz))

Vectorization
-------------
Your loop is fully vectorized, using full register length.

All SSE/AVX instructions are used in vector version (process two or more data elements in vector registers).


Execution units bottlenecks
---------------------------
Performance is limited by:
 - execution of FP multiply or FMA (fused multiply-add) operations (the FP multiply/FMA unit is a bottleneck)
 - reading data from caches/RAM (load units are a bottleneck)
 - writing data to caches/RAM (the store unit is a bottleneck)

Workaround(s):
 - Reduce the number of FP multiply/FMA instructions
 - Read less array elements
 - Write less array elements
 - Provide more information to your compiler:
  * hardcode the bounds of the corresponding 'for' loop



All innermost loops were analyzed.

Info: Rerun CQA with conf=hint,expert to display more advanced reports or conf=all to display them with default reports.

Let's rerun it with the conf=all option (this is mainly for experts but you know how to get this information) :

Here is the full output :

maqao.intel64 cqa conf=all --fct-loops=hadamard_product ./1-HadamardProduct/hadamard_product_intrinsics
Target processor is: Intel Kaby Lake Core Processors (x86_64/Kaby Lake micro-architecture).

Info: No innermost loops in the function _GLOBAL__sub_I__Z16hadamard_productPfPKfS1_m
Section 1: Function: hadamard_product(float*, float const*, float const*, unsigned long)
========================================================================================

Code for this function has been specialized for Broadwell. For execution on another machine, recompile on it or with explicit target
(example for a Haswell machine: use -march=haswell, see compiler manual for full list).
These loops are supposed to be defined in: Examples/1-HadamardProduct/main_intrinsics.cpp

Section 1.1: Source loop ending at line 24
==========================================

Composition and unrolling
-------------------------
It is composed of the loop 0
and is not unrolled or unrolled with no peel/tail loop.

Section 1.1.1: Binary loop #0
=============================

The loop is defined in:
 - /usr/lib/gcc/x86_64-linux-gnu/7/include/avxintrin.h: 319-879
 - Examples/1-HadamardProduct/main_intrinsics.cpp: 24-24


The related source loop is not unrolled or unrolled with no peel/tail loop.
25% of peak computational performance is used (8.00 out of 32.00 FLOP per cycle (GFLOPS @ 1GHz))

Vectorization
-------------
Your loop is fully vectorized, using full register length.

All SSE/AVX instructions are used in vector version (process two or more data elements in vector registers).


Execution units bottlenecks
---------------------------
Performance is limited by:
 - execution of FP multiply or FMA (fused multiply-add) operations (the FP multiply/FMA unit is a bottleneck)
 - reading data from caches/RAM (load units are a bottleneck)
 - writing data to caches/RAM (the store unit is a bottleneck)

Workaround(s):
 - Reduce the number of FP multiply/FMA instructions
 - Read less array elements
 - Write less array elements
 - Provide more information to your compiler:
  * hardcode the bounds of the corresponding 'for' loop


Type of elements and instruction set
------------------------------------
1 AVX instructions are processing arithmetic or math operations on single precision FP elements in vector mode (eight at a time).


Matching between your loop (in the source code) and the binary loop
-------------------------------------------------------------------
The binary loop is composed of 8 FP arithmetical operations:
 - 8: multiply
The binary loop is loading 64 bytes (16 single precision FP elements).
The binary loop is storing 32 bytes (8 single precision FP elements).

Arithmetic intensity
--------------------
Arithmetic intensity is 0.08 FP operations per loaded or stored byte.

Unroll opportunity
------------------
Loop body is too small to efficiently use resources.
Workaround(s):
Unroll your loop if trip count is significantly higher than target unroll factor. This can be done manually.
Or by recompiling with -funroll-loops and/or -floop-unroll-and-jam.

ASM code
--------
In the binary file, the address of the loop is: 1020

Instruction                                   | Nb FU | P0   | P1   | P2   | P3   | P4 | P5   | P6   | P7   | Latency | Recip. throughput
-----------------------------------------------------------------------------------------------------------------------------------------
VMOVAPS (%RSI,%RAX,1),%YMM0                   | 1     | 0    | 0    | 0.50 | 0.50 | 0  | 0    | 0    | 0    | 3       | 0.50
VMULPS (%RDX,%RAX,1),%YMM0,%YMM0              | 1     | 0.50 | 0.50 | 0.50 | 0.50 | 0  | 0    | 0    | 0    | 4       | 0.50
VMOVAPS %YMM0,(%RDI,%RAX,1)                   | 1     | 0    | 0    | 0.33 | 0.33 | 1  | 0    | 0    | 0.33 | 3       | 1
ADD $0x20,%RAX                                | 1     | 0.25 | 0.25 | 0    | 0    | 0  | 0.25 | 0.25 | 0    | 1       | 0.25
CMP %RAX,%RCX                                 | 1     | 0.25 | 0.25 | 0    | 0    | 0  | 0.25 | 0.25 | 0    | 1       | 0.25
JNE 1020 <_Z16hadamard_productPfPKfS1_m+0x10> | 1     | 0.50 | 0    | 0    | 0    | 0  | 0    | 0.50 | 0    | 0       | 0.50-1


General properties
------------------
nb instructions    : 6
nb uops            : 5
loop length        : 24
used x86 registers : 5
used mmx registers : 0
used xmm registers : 0
used ymm registers : 1
used zmm registers : 0
nb stack references: 0


Front-end
---------
ASSUMED MACRO FUSION
FIT IN UOP CACHE
micro-operation queue: 1.00 cycles
front end            : 1.00 cycles


Back-end
--------
       | P0   | P1   | P2   | P3   | P4   | P5   | P6   | P7
--------------------------------------------------------------
uops   | 1.00 | 0.75 | 1.00 | 1.00 | 1.00 | 0.75 | 0.50 | 1.00
cycles | 1.00 | 0.75 | 1.00 | 1.00 | 1.00 | 0.75 | 0.50 | 1.00

Cycles executing div or sqrt instructions: NA
Longest recurrence chain latency (RecMII): 1.00


Cycles summary
--------------
Front-end : 1.00
Dispatch  : 1.00
Data deps.: 1.00
Overall L1: 1.00


Vectorization ratios
--------------------
all    : 100%
load   : 100%
store  : 100%
mul    : 100%
add-sub: NA (no add-sub vectorizable/vectorized instructions)
other  : NA (no other vectorizable/vectorized instructions)


Vector efficiency ratios
------------------------
all    : 100%
load   : 100%
store  : 100%
mul    : 100%
add-sub: NA (no add-sub vectorizable/vectorized instructions)
other  : NA (no other vectorizable/vectorized instructions)


Cycles and memory resources usage
---------------------------------
Assuming all data fit into the L1 cache, each iteration of the binary loop takes 1.00 cycles. At this rate:
 - 100% of peak load performance is reached (64.00 out of 64.00 bytes loaded per cycle (GB/s @ 1GHz))
 - 100% of peak store performance is reached (32.00 out of 32.00 bytes stored per cycle (GB/s @ 1GHz))


Front-end bottlenecks
---------------------
Performance is limited by instruction throughput (loading/decoding program instructions to execution core) (front-end is a bottleneck).

By removing all these bottlenecks, you can lower the cost of an iteration from 1.00 to 0.75 cycles (1.33x speedup).



All innermost loops were analyzed.