3.4.7.1.4 : Produit de Hadamard Cuda sur MUST


Il ne faut pas oublier de remplacer G++9 par /opt/rh/devtoolset-7/root/usr/bin/x86_64-redhat-linux-g++

condor_submit submit.condor 
Submitting job(s).
1 job(s) submitted to cluster 9433.


Une fois que le job est terminé :

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
cat hadamard_product_cuda.output
Used machine is Linux lapp-wngpu005.in2p3.fr 3.10.0-1160.42.2.el7.x86_64 #1 SMP Tue Sep 7 14:49:57 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
Remove existing directory build
-- The C compiler identification is GNU 4.8.5
-- The CXX compiler identification is GNU 7.3.1
-- Check for working C compiler: /usr/bin/cc
-- Check for working C compiler: /usr/bin/cc - works
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Detecting C compile features
-- Detecting C compile features - done
-- Check for working CXX compiler: /opt/rh/devtoolset-7/root/usr/bin/x86_64-redhat-linux-g++
-- Check for working CXX compiler: /opt/rh/devtoolset-7/root/usr/bin/x86_64-redhat-linux-g++ - works
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Looking for pthread.h
-- Looking for pthread.h - found
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Failed
-- Looking for pthread_create in pthreads
-- Looking for pthread_create in pthreads - not found
-- Looking for pthread_create in pthread
-- Looking for pthread_create in pthread - found
-- Found Threads: TRUE  
-- Found CUDA: /usr/local/cuda (found version "11.4") 
-- Found headers CUDA : /usr/local/cuda/include
-- Found lib CUDA : /usr/local/cuda/lib64/libcudart_static.a;Threads::Threads;dl;/usr/lib64/librt.so
-- Configuring done
-- Generating done
-- Build files have been written to: XXX/HadamardProductCuda/build
Scanning dependencies of target asterics_hpc_cuda
[  9%] Building CXX object AstericsHPC/CMakeFiles/asterics_hpc_cuda.dir/asterics_alloc.cpp.o
[ 18%] Building CXX object AstericsHPC/CMakeFiles/asterics_hpc_cuda.dir/timer.cpp.o
[ 27%] Building CXX object AstericsHPC/CMakeFiles/asterics_hpc_cuda.dir/asterics_cuda.cpp.o
[ 36%] Linking CXX shared library libasterics_hpc_cuda.so
[ 36%] Built target asterics_hpc_cuda
[ 45%] Building NVCC (Device) object src/CMakeFiles/hadamard_product_cuda.dir/hadamard_product_cuda_generated_hadamard_product_cuda.cu.o
Scanning dependencies of target hadamard_product_cuda
[ 54%] Building C object src/CMakeFiles/hadamard_product_cuda.dir/phoenix_cuda_check.c.o
[ 63%] Linking CXX shared library libhadamard_product_cuda.so
[ 63%] Built target hadamard_product_cuda
Scanning dependencies of target hadamard_product_gpu_cuda_kernel
[ 72%] Building CXX object program/CMakeFiles/hadamard_product_gpu_cuda_kernel.dir/main_kernel.cpp.o
[ 81%] Linking CXX executable hadamard_product_gpu_cuda_kernel
[ 81%] Built target hadamard_product_gpu_cuda_kernel
Scanning dependencies of target hadamard_product_gpu_cuda
[ 90%] Building CXX object program/CMakeFiles/hadamard_product_gpu_cuda.dir/main.cpp.o
[100%] Linking CXX executable hadamard_product_gpu_cuda
[100%] Built target hadamard_product_gpu_cuda
-- Found headers CUDA : /usr/local/cuda/include
-- Found lib CUDA : /usr/local/cuda/lib64/libcudart_static.a;Threads::Threads;dl;/usr/lib64/librt.so
-- Configuring done
-- Generating done
-- Build files have been written to: XXX/HadamardProductCuda/build
[ 30%] Built target asterics_hpc_cuda
[ 38%] Linking CXX shared library libhadamard_product_cuda.so
[ 53%] Built target hadamard_product_cuda
[ 61%] Linking CXX executable hadamard_product_gpu_cuda
[ 69%] Built target hadamard_product_gpu_cuda
Scanning dependencies of target run_hadamard_product_gpu_cuda
[ 76%] Run hadamard_product_gpu_cuda program
Hadamard product
asterics_getNbCudaDevice : Detected 1 CUDA Capable device(s)
evaluateHadamardProduct : nbElement = 1000, cyclePerElement = 1301.92 cy/el, elapsedTime = 1301917 cy, res = 0
evaluateHadamardProduct : nbElement = 2000, cyclePerElement = 269.029 cy/el, elapsedTime = 538059 cy, res = 0
evaluateHadamardProduct : nbElement = 3000, cyclePerElement = 176.55 cy/el, elapsedTime = 529650 cy, res = 0
evaluateHadamardProduct : nbElement = 5000, cyclePerElement = 113.497 cy/el, elapsedTime = 567487 cy, res = 0
evaluateHadamardProduct : nbElement = 10000, cyclePerElement = 65.2588 cy/el, elapsedTime = 652588 cy, res = 0
evaluateHadamardProduct : nbElement = 20000, cyclePerElement = 35.1019 cy/el, elapsedTime = 702038 cy, res = 0
evaluateHadamardProduct : nbElement = 50000, cyclePerElement = 17.9917 cy/el, elapsedTime = 899584 cy, res = 0
evaluateHadamardProduct : nbElement = 100000, cyclePerElement = 11.7823 cy/el, elapsedTime = 1178235 cy, res = 0
evaluateHadamardProduct : nbElement = 500000, cyclePerElement = 8.31941 cy/el, elapsedTime = 4159704 cy, res = 0
evaluateHadamardProduct : nbElement = 1000000, cyclePerElement = 6.54184 cy/el, elapsedTime = 6541838 cy, res = 0
evaluateHadamardProduct : nbElement = 10000000, cyclePerElement = 5.74387 cy/el, elapsedTime = 57438678 cy, res = 0
[ 76%] Built target run_hadamard_product_gpu_cuda
[ 84%] Linking CXX executable hadamard_product_gpu_cuda_kernel
[ 92%] Built target hadamard_product_gpu_cuda_kernel
Scanning dependencies of target run_hadamard_product_gpu_cuda_kernel
[100%] Run hadamard_product_gpu_cuda_kernel program
Hadamard product Kernel
asterics_getNbCudaDevice : Detected 1 CUDA Capable device(s)
hadamard_product_cuda_clock : nbElement = 1000, cyclePerElement = 24.702000 cy/el, elapsedTime = 24702 cy, res = 0.000000
hadamard_product_cuda_clock : nbElement = 2000, cyclePerElement = 11.634500 cy/el, elapsedTime = 23269 cy, res = 0.000000
hadamard_product_cuda_clock : nbElement = 3000, cyclePerElement = 8.526000 cy/el, elapsedTime = 25578 cy, res = 0.000000
hadamard_product_cuda_clock : nbElement = 5000, cyclePerElement = 4.653800 cy/el, elapsedTime = 23269 cy, res = 0.000000
hadamard_product_cuda_clock : nbElement = 10000, cyclePerElement = 2.544200 cy/el, elapsedTime = 25442 cy, res = 0.000000
hadamard_product_cuda_clock : nbElement = 20000, cyclePerElement = 1.177200 cy/el, elapsedTime = 23544 cy, res = 0.000000
hadamard_product_cuda_clock : nbElement = 50000, cyclePerElement = 0.483620 cy/el, elapsedTime = 24181 cy, res = 0.000000
hadamard_product_cuda_clock : nbElement = 100000, cyclePerElement = 0.253780 cy/el, elapsedTime = 25378 cy, res = 0.000000
hadamard_product_cuda_clock : nbElement = 500000, cyclePerElement = 0.110096 cy/el, elapsedTime = 55048 cy, res = 0.000000
hadamard_product_cuda_clock : nbElement = 1000000, cyclePerElement = 0.085173 cy/el, elapsedTime = 85173 cy, res = 0.000000
hadamard_product_cuda_clock : nbElement = 10000000, cyclePerElement = 0.057759 cy/el, elapsedTime = 577595 cy, res = 0.000000
[100%] Built target run_hadamard_product_gpu_cuda_kernel
Scanning dependencies of target run_all
[100%] Built target run_all



Autre test sur lapp-wngpu005.in2p3.fr (en ciblant le bon GPU)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
cat hadamard_product_cuda_005.output
Used machine is Linux lapp-wngpu005.in2p3.fr 3.10.0-1160.42.2.el7.x86_64 #1 SMP Tue Sep 7 14:49:57 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
-- The C compiler identification is GNU 4.8.5
-- The CXX compiler identification is GNU 7.3.1
-- Check for working C compiler: /usr/bin/cc
-- Check for working C compiler: /usr/bin/cc - works
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Detecting C compile features
-- Detecting C compile features - done
-- Check for working CXX compiler: /opt/rh/devtoolset-7/root/usr/bin/x86_64-redhat-linux-g++
-- Check for working CXX compiler: /opt/rh/devtoolset-7/root/usr/bin/x86_64-redhat-linux-g++ - works
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Looking for pthread.h
-- Looking for pthread.h - found
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Failed
-- Looking for pthread_create in pthreads
-- Looking for pthread_create in pthreads - not found
-- Looking for pthread_create in pthread
-- Looking for pthread_create in pthread - found
-- Found Threads: TRUE  
-- Found CUDA: /usr/local/cuda (found version "11.4") 
-- Found headers CUDA : /usr/local/cuda/include
-- Found lib CUDA : /usr/local/cuda/lib64/libcudart_static.a;Threads::Threads;dl;/usr/lib64/librt.so
-- Configuring done
-- Generating done
-- Build files have been written to: /lapp_data/cta/paubert/TestCondor/COURS/HadamardProductCuda/build_lapp-wngpu005.in2p3.fr
Scanning dependencies of target asterics_hpc_cuda
[  9%] Building CXX object AstericsHPC/CMakeFiles/asterics_hpc_cuda.dir/asterics_alloc.cpp.o
[ 18%] Building CXX object AstericsHPC/CMakeFiles/asterics_hpc_cuda.dir/timer.cpp.o
[ 27%] Building CXX object AstericsHPC/CMakeFiles/asterics_hpc_cuda.dir/asterics_cuda.cpp.o
[ 36%] Linking CXX shared library libasterics_hpc_cuda.so
[ 36%] Built target asterics_hpc_cuda
[ 45%] Building NVCC (Device) object src/CMakeFiles/hadamard_product_cuda.dir/hadamard_product_cuda_generated_hadamard_product_cuda.cu.o
Scanning dependencies of target hadamard_product_cuda
[ 54%] Building C object src/CMakeFiles/hadamard_product_cuda.dir/phoenix_cuda_check.c.o
[ 63%] Linking CXX shared library libhadamard_product_cuda.so
[ 63%] Built target hadamard_product_cuda
Scanning dependencies of target hadamard_product_gpu_cuda_kernel
[ 72%] Building CXX object program/CMakeFiles/hadamard_product_gpu_cuda_kernel.dir/main_kernel.cpp.o
[ 81%] Linking CXX executable hadamard_product_gpu_cuda_kernel
[ 81%] Built target hadamard_product_gpu_cuda_kernel
Scanning dependencies of target hadamard_product_gpu_cuda
[ 90%] Building CXX object program/CMakeFiles/hadamard_product_gpu_cuda.dir/main.cpp.o
[100%] Linking CXX executable hadamard_product_gpu_cuda
[100%] Built target hadamard_product_gpu_cuda
-- Found headers CUDA : /usr/local/cuda/include
-- Found lib CUDA : /usr/local/cuda/lib64/libcudart_static.a;Threads::Threads;dl;/usr/lib64/librt.so
-- Configuring done
-- Generating done
-- Build files have been written to: /lapp_data/cta/paubert/TestCondor/COURS/HadamardProductCuda/build_lapp-wngpu005.in2p3.fr
[ 30%] Built target asterics_hpc_cuda
[ 38%] Linking CXX shared library libhadamard_product_cuda.so
[ 53%] Built target hadamard_product_cuda
[ 61%] Linking CXX executable hadamard_product_gpu_cuda
[ 69%] Built target hadamard_product_gpu_cuda
Scanning dependencies of target run_hadamard_product_gpu_cuda
[ 76%] Run hadamard_product_gpu_cuda program
Hadamard product
asterics_getNbCudaDevice : Detected 1 CUDA Capable device(s)
evaluateHadamardProduct : nbElement = 1000, cyclePerElement = 921.275 cy/el, elapsedTime = 921275 cy, res = 0
evaluateHadamardProduct : nbElement = 2000, cyclePerElement = 196.816 cy/el, elapsedTime = 393632 cy, res = 0
evaluateHadamardProduct : nbElement = 3000, cyclePerElement = 129.717 cy/el, elapsedTime = 389150 cy, res = 0
evaluateHadamardProduct : nbElement = 5000, cyclePerElement = 80.346 cy/el, elapsedTime = 401730 cy, res = 0
evaluateHadamardProduct : nbElement = 10000, cyclePerElement = 43.8766 cy/el, elapsedTime = 438766 cy, res = 0
evaluateHadamardProduct : nbElement = 20000, cyclePerElement = 25.6061 cy/el, elapsedTime = 512122 cy, res = 0
evaluateHadamardProduct : nbElement = 50000, cyclePerElement = 14.7961 cy/el, elapsedTime = 739806 cy, res = 0
evaluateHadamardProduct : nbElement = 100000, cyclePerElement = 11.1793 cy/el, elapsedTime = 1117927 cy, res = 0
evaluateHadamardProduct : nbElement = 500000, cyclePerElement = 8.35432 cy/el, elapsedTime = 4177158 cy, res = 0
evaluateHadamardProduct : nbElement = 1000000, cyclePerElement = 6.66748 cy/el, elapsedTime = 6667482 cy, res = 0
evaluateHadamardProduct : nbElement = 10000000, cyclePerElement = 5.87742 cy/el, elapsedTime = 58774161 cy, res = 0
[ 76%] Built target run_hadamard_product_gpu_cuda
[ 84%] Linking CXX executable hadamard_product_gpu_cuda_kernel
[ 92%] Built target hadamard_product_gpu_cuda_kernel
Scanning dependencies of target run_hadamard_product_gpu_cuda_kernel
[100%] Run hadamard_product_gpu_cuda_kernel program
Hadamard product Kernel
asterics_getNbCudaDevice : Detected 1 CUDA Capable device(s)
hadamard_product_cuda_clock : nbElement = 1000, cyclePerElement = 19.644000 cy/el, elapsedTime = 19644 cy, res = 0.000000
hadamard_product_cuda_clock : nbElement = 2000, cyclePerElement = 9.631000 cy/el, elapsedTime = 19262 cy, res = 0.000000
hadamard_product_cuda_clock : nbElement = 3000, cyclePerElement = 6.336667 cy/el, elapsedTime = 19010 cy, res = 0.000000
hadamard_product_cuda_clock : nbElement = 5000, cyclePerElement = 3.742400 cy/el, elapsedTime = 18712 cy, res = 0.000000
hadamard_product_cuda_clock : nbElement = 10000, cyclePerElement = 1.863900 cy/el, elapsedTime = 18639 cy, res = 0.000000
hadamard_product_cuda_clock : nbElement = 20000, cyclePerElement = 0.939500 cy/el, elapsedTime = 18790 cy, res = 0.000000
hadamard_product_cuda_clock : nbElement = 50000, cyclePerElement = 0.369420 cy/el, elapsedTime = 18471 cy, res = 0.000000
hadamard_product_cuda_clock : nbElement = 100000, cyclePerElement = 0.190610 cy/el, elapsedTime = 19061 cy, res = 0.000000
hadamard_product_cuda_clock : nbElement = 500000, cyclePerElement = 0.096732 cy/el, elapsedTime = 48366 cy, res = 0.000000
hadamard_product_cuda_clock : nbElement = 1000000, cyclePerElement = 0.077998 cy/el, elapsedTime = 77998 cy, res = 0.000000
hadamard_product_cuda_clock : nbElement = 10000000, cyclePerElement = 0.056383 cy/el, elapsedTime = 563831 cy, res = 0.000000
[100%] Built target run_hadamard_product_gpu_cuda_kernel
Scanning dependencies of target run_all
[100%] Built target run_all


Note : pour faire les graphiques il ne faut transférer les résultats des tests de performance (dans build/Examples/Performances/) sur une machine qui a gnuplot.


La figure 13 montre les performances de notre produit de Hadamard sur un GPU P6000. On montre encore une fois qu'il est plus efficace de faire le plus de calcul possible sur le GPU et de limiter les transferts de données.

nothing nothing

Figure 13 : Performance du produit de Hadamard. À gauche : le temps total d'exécution en cycle CPU. À droite : le temps d'execution en cycles par element.



Voyons cela sur une A100



Testons avec la machine lapp-wngpu008 :
condor_submit submit.condor
Submitting job(s).
1 job(s) submitted to cluster 9441.


Testons avec la machine lapp-wngpu007 :
condor_submit submit.condor 
Submitting job(s).
1 job(s) submitted to cluster 9442.