Main Page
1.
Introduction to High Performance Computing
2.
Basic use of CMake
❱
2.1.
What is CMake ?
2.2.
Hello world with CMake
3.
Starting the project
4.
Several useful CMake functions
❱
4.1.
The runExample function
4.2.
The runPythonExample function
4.3.
The plotPerf function
4.4.
Summary
4.5.
Functions to check Python environnement and build python module
❱
4.5.1.
Check the environnement
4.5.2.
Make python module
4.5.3.
Summary
5.
Creation of a HPC/Timer library
❱
5.1.
The rdtsc files
❱
5.1.1.
The header (timer.h)
5.1.2.
The source (timer.cpp)
5.2.
The allocation/deallocation files
❱
5.2.1.
The header (asterics_alloc.h)
5.2.2.
The source (asterics_alloc.cpp)
5.3.
The main header (asterics_hpc.h)
5.4.
The CMakeLists.txt
5.5.
The compilation
5.6.
The associated python module
❱
5.6.1.
Wrapper of the timer
5.6.2.
Wrapper of the tables allocation
5.6.3.
Wrapper of the matrices allocation
5.6.4.
The wrapper module source : astericshpc.cpp
5.6.5.
The module configuration : setup.py
5.6.6.
The python install cmake script
5.6.7.
The CMakeLists.txt
6.
Optimisation of Hadamard product
❱
6.1.
What is the Hadamard product ?
6.2.
Main to evaluate the Hadamard product
6.3.
The CMakeLists.txt file
6.4.
Get the performances
6.5.
The first performances
6.6.
How to vectorize the computation
❱
6.6.1.
What is vectorization ?
6.6.2.
Automatic vectorization (by the compiler)
❱
6.6.2.1.
Things to verify before vectorizing
6.6.2.2.
The full main_vectorize.cpp file
6.6.2.3.
The CMakeLists.txt file
6.6.2.4.
Compilation
6.6.2.5.
The performances with vectorization by the compiler
6.6.3.
Manual vectorization (by Intrinsic functions)
❱
6.6.3.1.
Begining of the main_intrinsics.cpp file
6.6.3.2.
The hadamard_product function
6.6.3.3.
The function to evaluate performances
6.6.3.4.
The main function
6.6.3.5.
Full main_intrinsics.cpp file
6.6.3.6.
The CMakeLists.txt file
6.6.3.7.
Compilation
6.6.3.8.
The performances with Intrinsics
6.6.4.
Conclusion on vectorization
6.7.
How to create a hadamard python module
❱
6.7.1.
The C++ kernel
6.7.2.
The wrapper function
6.7.3.
The C++ module file
6.7.4.
The setup.py file
6.7.5.
Peformances tests
❱
6.7.5.1.
A naive implementation of the hadamard product
6.7.5.2.
Hadamard product with numpy functions
6.7.5.3.
Hadamard product with our intrinsics pitch implementation
6.7.5.4.
What hapened if I use python list instead of numpy array for naive implementation ?
6.7.6.
The CMakeLists.txt file
6.7.7.
Performances results
❱
6.7.7.1.
Basic performances
6.7.7.2.
And the lists ?
6.7.7.3.
Summary
7.
Optimisation of saxpy
❱
7.1.
What is a Saxpy ?
7.2.
The classical approach
❱
7.2.1.
The main.cpp
7.2.2.
The CMakeLists.txt
7.2.3.
The compilation
7.2.4.
The performances
7.3.
The vectorization of Saxpy
❱
7.3.1.
The main_vectorize.cpp
7.3.2.
The CMakeLists.txt
7.3.3.
The compilation
7.3.4.
The performances
7.4.
The intrinsics version of Saxpy
❱
7.4.1.
The main_intrinsics.cpp
7.4.2.
The CMakeLists.txt
7.4.3.
The compilation
7.4.4.
The performances
7.5.
How to create a saxpy python module
❱
7.5.1.
The C++ kernel
7.5.2.
The wrapper function
7.5.3.
The C++ module file
7.5.4.
The setup.py file
7.5.5.
Peformances tests
❱
7.5.5.1.
A naive implementation of the saxpy
7.5.5.2.
Saxpy with numpy functions
7.5.5.3.
Saxpy with our intrinsics implementation
7.5.6.
The CMakeLists.txt file
7.5.7.
Performances results
❱
7.5.7.1.
Basic performances
7.5.7.2.
Summary
8.
Optimisation of a reduction
❱
8.1.
What is a reduction ?
8.2.
The classical approach
❱
8.2.1.
The main.cpp
8.2.2.
The CMakeLists.txt
8.2.3.
The compilation
8.2.4.
The performances
8.2.5.
Solving the performance problem
❱
8.2.5.1.
The reduction.h file
8.2.5.2.
The reduction.cpp file
8.2.5.3.
The main_reduction.cpp file
8.2.5.4.
The CMakeLists.txt file
8.2.5.5.
The compilation
8.2.5.6.
The performances
8.3.
The vectorization of reduction
❱
8.3.1.
The reduction_vectorize.h
8.3.2.
The reduction_vectorize.cpp
8.3.3.
The main_vectorize.cpp
8.3.4.
The CMakeLists.txt
❱
8.3.4.1.
The compilation
8.3.4.2.
The performances
8.4.
The vectorization of reduction with intrinsic functions
❱
8.4.1.
The reduction_intrinsics.h file
8.4.2.
The reduction_intrinsics.cpp file
8.4.3.
The main_intrinsics.cpp
8.4.4.
The CMakeLists.txt file
8.4.5.
The compilation
8.4.6.
The performances
8.5.
How to optimize more
❱
8.5.1.
Interleaving 2 times
❱
8.5.1.1.
The reduction_intrinsics_interleave2.h file
8.5.1.2.
The reduction_intrinsics_interleave2.cpp file
8.5.1.3.
The main_intrinsics_interleave2.cpp file
8.5.1.4.
The CMakeLists.txt file
8.5.1.5.
The compilation
8.5.1.6.
The performances
8.5.2.
Interleaving 4 times
❱
8.5.2.1.
The reduction_intrinsics_interleave4.h file
8.5.2.2.
The reduction_intrinsics_interleave4.cpp file
8.5.2.3.
The main_intrinsics_interleave4.cpp file
8.5.2.4.
The CMakeLists.txt file
8.5.2.5.
The compilation
8.5.2.6.
The performances
8.5.3.
Interleaving 8 times
❱
8.5.3.1.
The reduction_intrinsics_interleave8.h file
8.5.3.2.
The reduction_intrinsics_interleave8.cpp file
8.5.3.3.
The main_intrinsics_interleave8.cpp file
8.5.3.4.
The CMakeLists.txt file
8.5.3.5.
The compilation
8.5.3.6.
The performances
8.5.4.
Summary
8.6.
How to create a reduction python module
❱
8.6.1.
The wrapper function
8.6.2.
The C++ module file
8.6.3.
The setup.py file
8.6.4.
Peformances tests
❱
8.6.4.1.
Reduction with numpy functions
8.6.4.2.
Reduction with our intrinsics implementation
8.6.5.
The CMakeLists.txt file
8.6.6.
Performances results
❱
8.6.6.1.
Basic performances
8.6.6.2.
Summary
9.
Application/exercice : Optimisation barycentre computation
❱
9.1.
What is a barycentre ?
9.2.
The classical approach
❱
9.2.1.
The barycentre.h file
9.2.2.
The barycentre.cpp file
9.2.3.
The main_barycentre.cpp file
9.2.4.
The CMakeLists.txt file
9.2.5.
The compilation
9.2.6.
The performances
9.3.
The vectorization of barycentre
❱
9.3.1.
The barycentre_vectorize.h
9.3.2.
The barycentre_vectorize.cpp
9.3.3.
The main_barycentre_vectorize.cpp
9.3.4.
The barycentre_vectorizeSplit.h
9.3.5.
The barycentre_vectorizeSplit.cpp
9.3.6.
The main_barycentre_vectorizeSplit.cpp
9.3.7.
The CMakeLists.txt
9.3.8.
The compilation
9.3.9.
The performances
9.4.
The intrinsics version of barycentre
❱
9.4.1.
The barycentre_intrinsics.h file
9.4.2.
The barycentre_intrinsics.cpp file
9.4.3.
The CMakeLists.txt file
9.4.4.
The compilation
9.4.5.
The performances
9.5.
How to create a barycentre python module
❱
9.5.1.
The wrapper function
9.5.2.
The C++ module file
9.5.3.
The setup.py file
9.5.4.
Peformances tests
❱
9.5.4.1.
Barycentre with numpy functions
9.5.4.2.
Barycentre with our intrinsics implementation
9.5.5.
The CMakeLists.txt file
9.5.6.
Performances results
❱
9.5.6.1.
Basic performances
9.5.6.2.
Summary
10.
Optimisation of Dense Matrix-Matrix multiplication
❱
10.1.
What is a SGEMM ?
10.2.
The classical approach
❱
10.2.1.
The sgemm.h file
10.2.2.
The sgemm.cpp file
10.2.3.
The main_sgemm.cpp file
10.2.4.
The CMakeLists.txt file
10.2.5.
The compilation
10.2.6.
The performances
10.3.
Let's swap the loops over j and k
❱
10.3.1.
The sgemm_swap.h file
10.3.2.
The sgemm_swap.cpp file
10.3.3.
The main_sgemm_swap.cpp file
10.3.4.
The CMakeLists.txt file
10.3.5.
The compilation
10.3.6.
The performances
10.4.
Vectorization
❱
10.4.1.
The sgemm_vectorize.h file
10.4.2.
The sgemm_vectorize.cpp file
10.4.3.
The main_sgemm_vectorize.cpp file
10.4.4.
The CMakeLists.txt file
10.4.5.
The compilation
10.4.6.
The performances
10.5.
Intrinsics implementation
❱
10.5.1.
The sgemm_intrinsics.h file
10.5.2.
The sgemm_intrinsics.cpp file
10.5.3.
The main_sgemm_intrinsics.cpp file
10.5.4.
The CMakeLists.txt file
10.5.5.
The compilation
10.5.6.
The performances
10.6.
Intrinsics implementation with a pitch
❱
10.6.1.
The sgemm_intrinsics_pitch.h file
10.6.2.
The sgemm_intrinsics_pitch.cpp file
10.6.3.
The main_sgemm_intrinsics_pitch.cpp file
10.6.4.
The CMakeLists.txt file
10.6.5.
The compilation
10.6.6.
The performances
10.7.
How to create a sgemm python module
❱
10.7.1.
The wrapper function
10.7.2.
The C++ module file
10.7.3.
The setup.py file
10.7.4.
Peformances tests
❱
10.7.4.1.
Sgemm with numpy functions
10.7.4.2.
Sgemm with our intrinsics implementation
10.7.5.
The CMakeLists.txt file
10.7.6.
Performances results
❱
10.7.6.1.
Basic performances
10.7.6.2.
Summary
11.
What about branching ? (bonus)
❱
11.1.
Classical implementation
❱
11.1.1.
The main.cpp file
11.1.2.
The CMakeLists.txt file
11.1.3.
The compilation
11.1.4.
The performances
11.2.
Implementation without if
❱
11.2.1.
The main_optimise.cpp file
11.2.2.
The CMakeLists.txt file
11.2.3.
The compilation
11.2.4.
The performances
11.3.
Vectorization Implementation
❱
11.3.1.
The main_vectorize.cpp file
11.3.2.
The CMakeLists.txt file
11.3.3.
The compilation
11.3.4.
The performances
11.4.
Intrinsics Implementation
❱
11.4.1.
The main_intrinsics.cpp file
11.4.2.
The CMakeLists.txt file
11.4.3.
The compilation
11.4.4.
The performances
Navy
Chapter 7.2 : The classical approach
7.2.1) The main.cpp
7.2.2) The CMakeLists.txt
7.2.3) The compilation
7.2.4) The performances
Now, you already now that we want to evaluate the computing performances of the
saxpy
kernel. So explainations will be shorted.