Main Page
1.
Introduction to High Performance Computing
2.
Basic use of CMake
❱
2.1.
What is CMake ?
2.2.
Hello world with CMake
3.
Starting the project
4.
Several useful CMake functions
❱
4.1.
The runExample function
4.2.
The runPythonExample function
4.3.
The plotPerf function
4.4.
Summary
4.5.
Functions to check Python environnement and build python module
❱
4.5.1.
Check the environnement
4.5.2.
Make python module
4.5.3.
Summary
5.
Creation of a HPC/Timer library
❱
5.1.
The rdtsc files
❱
5.1.1.
The header (timer.h)
5.1.2.
The source (timer.cpp)
5.2.
The allocation/deallocation files
❱
5.2.1.
The header (asterics_alloc.h)
5.2.2.
The source (asterics_alloc.cpp)
5.3.
The main header (asterics_hpc.h)
5.4.
The CMakeLists.txt
5.5.
The compilation
5.6.
The associated python module
❱
5.6.1.
Wrapper of the timer
5.6.2.
Wrapper of the tables allocation
5.6.3.
Wrapper of the matrices allocation
5.6.4.
The wrapper module source : astericshpc.cpp
5.6.5.
The module configuration : setup.py
5.6.6.
The python install cmake script
5.6.7.
The CMakeLists.txt
6.
Optimisation of Hadamard product
❱
6.1.
What is the Hadamard product ?
6.2.
Main to evaluate the Hadamard product
6.3.
The CMakeLists.txt file
6.4.
Get the performances
6.5.
The first performances
6.6.
How to vectorize the computation
❱
6.6.1.
What is vectorization ?
6.6.2.
Automatic vectorization (by the compiler)
❱
6.6.2.1.
Things to verify before vectorizing
6.6.2.2.
The full main_vectorize.cpp file
6.6.2.3.
The CMakeLists.txt file
6.6.2.4.
Compilation
6.6.2.5.
The performances with vectorization by the compiler
6.6.3.
Manual vectorization (by Intrinsic functions)
❱
6.6.3.1.
Begining of the main_intrinsics.cpp file
6.6.3.2.
The hadamard_product function
6.6.3.3.
The function to evaluate performances
6.6.3.4.
The main function
6.6.3.5.
Full main_intrinsics.cpp file
6.6.3.6.
The CMakeLists.txt file
6.6.3.7.
Compilation
6.6.3.8.
The performances with Intrinsics
6.6.4.
Conclusion on vectorization
6.7.
How to create a hadamard python module
❱
6.7.1.
The C++ kernel
6.7.2.
The wrapper function
6.7.3.
The C++ module file
6.7.4.
The setup.py file
6.7.5.
Peformances tests
❱
6.7.5.1.
A naive implementation of the hadamard product
6.7.5.2.
Hadamard product with numpy functions
6.7.5.3.
Hadamard product with our intrinsics pitch implementation
6.7.5.4.
What hapened if I use python list instead of numpy array for naive implementation ?
6.7.6.
The CMakeLists.txt file
6.7.7.
Performances results
❱
6.7.7.1.
Basic performances
6.7.7.2.
And the lists ?
6.7.7.3.
Summary
7.
Optimisation of saxpy
❱
7.1.
What is a Saxpy ?
7.2.
The classical approach
❱
7.2.1.
The main.cpp
7.2.2.
The CMakeLists.txt
7.2.3.
The compilation
7.2.4.
The performances
7.3.
The vectorization of Saxpy
❱
7.3.1.
The main_vectorize.cpp
7.3.2.
The CMakeLists.txt
7.3.3.
The compilation
7.3.4.
The performances
7.4.
The intrinsics version of Saxpy
❱
7.4.1.
The main_intrinsics.cpp
7.4.2.
The CMakeLists.txt
7.4.3.
The compilation
7.4.4.
The performances
7.5.
How to create a saxpy python module
❱
7.5.1.
The C++ kernel
7.5.2.
The wrapper function
7.5.3.
The C++ module file
7.5.4.
The setup.py file
7.5.5.
Peformances tests
❱
7.5.5.1.
A naive implementation of the saxpy
7.5.5.2.
Saxpy with numpy functions
7.5.5.3.
Saxpy with our intrinsics implementation
7.5.6.
The CMakeLists.txt file
7.5.7.
Performances results
❱
7.5.7.1.
Basic performances
7.5.7.2.
Summary
8.
Optimisation of a reduction
❱
8.1.
What is a reduction ?
8.2.
The classical approach
❱
8.2.1.
The main.cpp
8.2.2.
The CMakeLists.txt
8.2.3.
The compilation
8.2.4.
The performances
8.2.5.
Solving the performance problem
❱
8.2.5.1.
The reduction.h file
8.2.5.2.
The reduction.cpp file
8.2.5.3.
The main_reduction.cpp file
8.2.5.4.
The CMakeLists.txt file
8.2.5.5.
The compilation
8.2.5.6.
The performances
8.3.
The vectorization of reduction
❱
8.3.1.
The reduction_vectorize.h
8.3.2.
The reduction_vectorize.cpp
8.3.3.
The main_vectorize.cpp
8.3.4.
The CMakeLists.txt
❱
8.3.4.1.
The compilation
8.3.4.2.
The performances
8.4.
The vectorization of reduction with intrinsic functions
❱
8.4.1.
The reduction_intrinsics.h file
8.4.2.
The reduction_intrinsics.cpp file
8.4.3.
The main_intrinsics.cpp
8.4.4.
The CMakeLists.txt file
8.4.5.
The compilation
8.4.6.
The performances
8.5.
How to optimize more
❱
8.5.1.
Interleaving 2 times
❱
8.5.1.1.
The reduction_intrinsics_interleave2.h file
8.5.1.2.
The reduction_intrinsics_interleave2.cpp file
8.5.1.3.
The main_intrinsics_interleave2.cpp file
8.5.1.4.
The CMakeLists.txt file
8.5.1.5.
The compilation
8.5.1.6.
The performances
8.5.2.
Interleaving 4 times
❱
8.5.2.1.
The reduction_intrinsics_interleave4.h file
8.5.2.2.
The reduction_intrinsics_interleave4.cpp file
8.5.2.3.
The main_intrinsics_interleave4.cpp file
8.5.2.4.
The CMakeLists.txt file
8.5.2.5.
The compilation
8.5.2.6.
The performances
8.5.3.
Interleaving 8 times
❱
8.5.3.1.
The reduction_intrinsics_interleave8.h file
8.5.3.2.
The reduction_intrinsics_interleave8.cpp file
8.5.3.3.
The main_intrinsics_interleave8.cpp file
8.5.3.4.
The CMakeLists.txt file
8.5.3.5.
The compilation
8.5.3.6.
The performances
8.5.4.
Summary
8.6.
How to create a reduction python module
❱
8.6.1.
The wrapper function
8.6.2.
The C++ module file
8.6.3.
The setup.py file
8.6.4.
Peformances tests
❱
8.6.4.1.
Reduction with numpy functions
8.6.4.2.
Reduction with our intrinsics implementation
8.6.5.
The CMakeLists.txt file
8.6.6.
Performances results
❱
8.6.6.1.
Basic performances
8.6.6.2.
Summary
9.
Application/exercice : Optimisation barycentre computation
❱
9.1.
What is a barycentre ?
9.2.
The classical approach
❱
9.2.1.
The barycentre.h file
9.2.2.
The barycentre.cpp file
9.2.3.
The main_barycentre.cpp file
9.2.4.
The CMakeLists.txt file
9.2.5.
The compilation
9.2.6.
The performances
9.3.
The vectorization of barycentre
❱
9.3.1.
The barycentre_vectorize.h
9.3.2.
The barycentre_vectorize.cpp
9.3.3.
The main_barycentre_vectorize.cpp
9.3.4.
The barycentre_vectorizeSplit.h
9.3.5.
The barycentre_vectorizeSplit.cpp
9.3.6.
The main_barycentre_vectorizeSplit.cpp
9.3.7.
The CMakeLists.txt
9.3.8.
The compilation
9.3.9.
The performances
9.4.
The intrinsics version of barycentre
❱
9.4.1.
The barycentre_intrinsics.h file
9.4.2.
The barycentre_intrinsics.cpp file
9.4.3.
The CMakeLists.txt file
9.4.4.
The compilation
9.4.5.
The performances
9.5.
How to create a barycentre python module
❱
9.5.1.
The wrapper function
9.5.2.
The C++ module file
9.5.3.
The setup.py file
9.5.4.
Peformances tests
❱
9.5.4.1.
Barycentre with numpy functions
9.5.4.2.
Barycentre with our intrinsics implementation
9.5.5.
The CMakeLists.txt file
9.5.6.
Performances results
❱
9.5.6.1.
Basic performances
9.5.6.2.
Summary
10.
Optimisation of Dense Matrix-Matrix multiplication
❱
10.1.
What is a SGEMM ?
10.2.
The classical approach
❱
10.2.1.
The sgemm.h file
10.2.2.
The sgemm.cpp file
10.2.3.
The main_sgemm.cpp file
10.2.4.
The CMakeLists.txt file
10.2.5.
The compilation
10.2.6.
The performances
10.3.
Let's swap the loops over j and k
❱
10.3.1.
The sgemm_swap.h file
10.3.2.
The sgemm_swap.cpp file
10.3.3.
The main_sgemm_swap.cpp file
10.3.4.
The CMakeLists.txt file
10.3.5.
The compilation
10.3.6.
The performances
10.4.
Vectorization
❱
10.4.1.
The sgemm_vectorize.h file
10.4.2.
The sgemm_vectorize.cpp file
10.4.3.
The main_sgemm_vectorize.cpp file
10.4.4.
The CMakeLists.txt file
10.4.5.
The compilation
10.4.6.
The performances
10.5.
Intrinsics implementation
❱
10.5.1.
The sgemm_intrinsics.h file
10.5.2.
The sgemm_intrinsics.cpp file
10.5.3.
The main_sgemm_intrinsics.cpp file
10.5.4.
The CMakeLists.txt file
10.5.5.
The compilation
10.5.6.
The performances
10.6.
Intrinsics implementation with a pitch
❱
10.6.1.
The sgemm_intrinsics_pitch.h file
10.6.2.
The sgemm_intrinsics_pitch.cpp file
10.6.3.
The main_sgemm_intrinsics_pitch.cpp file
10.6.4.
The CMakeLists.txt file
10.6.5.
The compilation
10.6.6.
The performances
10.7.
How to create a sgemm python module
❱
10.7.1.
The wrapper function
10.7.2.
The C++ module file
10.7.3.
The setup.py file
10.7.4.
Peformances tests
❱
10.7.4.1.
Sgemm with numpy functions
10.7.4.2.
Sgemm with our intrinsics implementation
10.7.5.
The CMakeLists.txt file
10.7.6.
Performances results
❱
10.7.6.1.
Basic performances
10.7.6.2.
Summary
11.
What about branching ? (bonus)
❱
11.1.
Classical implementation
❱
11.1.1.
The main.cpp file
11.1.2.
The CMakeLists.txt file
11.1.3.
The compilation
11.1.4.
The performances
11.2.
Implementation without if
❱
11.2.1.
The main_optimise.cpp file
11.2.2.
The CMakeLists.txt file
11.2.3.
The compilation
11.2.4.
The performances
11.3.
Vectorization Implementation
❱
11.3.1.
The main_vectorize.cpp file
11.3.2.
The CMakeLists.txt file
11.3.3.
The compilation
11.3.4.
The performances
11.4.
Intrinsics Implementation
❱
11.4.1.
The main_intrinsics.cpp file
11.4.2.
The CMakeLists.txt file
11.4.3.
The compilation
11.4.4.
The performances
Navy
Chapter 10.4 : Vectorization
10.4.1) The sgemm_vectorize.h file
10.4.2) The sgemm_vectorize.cpp file
10.4.3) The main_sgemm_vectorize.cpp file
10.4.4) The CMakeLists.txt file
10.4.5) The compilation
10.4.6) The performances