Récupération des Compute capabilities avec un job

3.4.7.1.3 : Récupération des Compute capabilities avec un job

Il nous faudra tout d'abord le projet que nous avons développé à la section 3.2.

Nous allons y ajouter un script de compilation install.sh :

#!/bin/bash

echo "Used machine is $(uname -a)"

if [ -d build ]
then
    	echo "Remove existing directory build"
        rm -fr build
fi

mkdir -p build
cd build

cmake3 ..
make
./test_cuda_capabilities

Puis notre configuration condor, submit.condor :

# Nom de l'executable
executable=install.sh
# On dit a Condor que l'on veut un environnement vide
universe=vanilla
# Fichier de sortie standard
output=install.output
# Fichier d'erreur
error=install.error
# On définit un fichier de log
log=install.log
# Pour transmettre l'environnement au job
getenv = True

# On ne veux qu'un GPU
request_gpus = 1
# for a specific GPU server, replace XXX with 001 to 009 according to your needs
# requirements = machine == "lapp-wngpu007.in2p3.fr"

# for a specific GPU type, replace XXX with k80, v100, p6000, t4 or a100
# +wantGpuType = "XXX"

# On veut lancer un seul job
queue

Lançons notre job :

condor_submit submit.condor
Submitting job(s).
1 job(s) submitted to cluster 9421.

Voyons le résultat :

cat install.output
Used machine is Linux lapp-wngpu005.in2p3.fr 3.10.0-1160.42.2.el7.x86_64 #1 SMP Tue Sep 7 15:12:57 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
Remove existing directory build
-- The C compiler identification is GNU 4.8.5
-- The CXX compiler identification is GNU 4.8.5
-- Check for working C compiler: /usr/bin/cc
-- Check for working C compiler: /usr/bin/cc - works
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Detecting C compile features
-- Detecting C compile features - done
-- Check for working CXX compiler: /usr/bin/c++
-- Check for working CXX compiler: /usr/bin/c++ - works
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Looking for pthread.h
-- Looking for pthread.h - found
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Failed
-- Looking for pthread_create in pthreads
-- Looking for pthread_create in pthreads - not found
-- Looking for pthread_create in pthread
-- Looking for pthread_create in pthread - found
-- Found Threads: TRUE  
-- Found CUDA: /usr/local/cuda (found version "11.4") 
-- Found headers CUDA : /usr/local/cuda/include
-- Found lib CUDA : /usr/local/cuda/lib64/libcudart_static.a;Threads::Threads;dl;/usr/lib64/librt.so
-- Configuring done
-- Generating done
-- Build files have been written to: XXX/TestCudaCapabilities/build
Scanning dependencies of target test_cuda_capabilities
[ 50%] Building CXX object CMakeFiles/test_cuda_capabilities.dir/main.cpp.o
[100%] Linking CXX executable test_cuda_capabilities
[100%] Built target test_cuda_capabilities
Detected 1 CUDA Capable device(s)

Device 0: "Quadro P6000"
	CUDA Driver Version / Runtime Version          11.4 / 11.4
	CUDA Capability Major/Minor version number:    6.1
	Total amount of global memory:                 24450 MBytes (25637224448 bytes)
	MapSMtoCores for SM 6.1 is undefined.  Default to use 128 Cores/SM
	(30) Multiprocessors, (128) CUDA Cores/MP:     3840 CUDA Cores
	GPU Clock rate:                                1645 MHz (1.64 GHz)
	Maximum Texture Dimension Size (x,y,z)         1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
	Maximum Layered 1D Texture Size, (num) layers  1D=(32768), 2048 layers
	Maximum Layered 2D Texture Size, (num) layers  2D=(32768, 32768), 2048 layers
	Total amount of constant memory:               65536 bytes
	Total amount of shared memory per block:       49152 bytes
	Total number of registers available per block: 65536
	Warp size:                                     32
	Maximum number of threads per multiprocessor:  2048
	Maximum number of threads per block:           1024
	Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
	Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
	Maximum memory pitch:                          2147483647 bytes
	Texture alignment:                             512 bytes
	Concurrent copy and kernel execution:          Yes with 2 copy engine(s)
	Run time limit on kernels:                     No
	Integrated GPU sharing Host Memory:            No
	Support host page-locked memory mapping:       Yes
	Alignment requirement for Surfaces:            Yes
	Device has ECC support:                        Disabled

Il se trouve que la machine lapp-wngpu005 n'a pas de A100 mais une P6000 qui est tout de même puissante (en tout cas bien plus puissante que ma Quadro M2200).

Comme pour la Quadro M2200, la taille maximale d'un bloc à 2 dimensions est $32 \times 32 = 1024$ :

Cela nous fera également $60 \times 36 = 2160$ blocs pour le calcul.

Avec une A100

Pour la lapp-wngpu007 :

condor_submit submit.condor 
Submitting job(s).
1 job(s) submitted to cluster 9439.

Pour la lapp-wngpu008 :

condor_submit submit.condor 
Submitting job(s).
1 job(s) submitted to cluster 9440.

Voilà ce que l'on obtient sur la machine lapp-wngpu008

cat test_cuda_capabilities_008.output
Used machine is Linux lapp-wngpu008.in2p3.fr 3.10.0-1160.42.2.el7.x86_64 #1 SMP Tue Sep 7 14:49:57 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
Remove existing directory build
-- The C compiler identification is GNU 4.8.5
-- The CXX compiler identification is GNU 7.3.1
-- Check for working C compiler: /usr/bin/cc
-- Check for working C compiler: /usr/bin/cc - works
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Detecting C compile features
-- Detecting C compile features - done
-- Check for working CXX compiler: /opt/rh/devtoolset-7/root/usr/bin/x86_64-redhat-linux-g++
-- Check for working CXX compiler: /opt/rh/devtoolset-7/root/usr/bin/x86_64-redhat-linux-g++ - works
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Looking for pthread.h
-- Looking for pthread.h - found
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Failed
-- Looking for pthread_create in pthreads
-- Looking for pthread_create in pthreads - not found
-- Looking for pthread_create in pthread
-- Looking for pthread_create in pthread - found
-- Found Threads: TRUE  
-- Found CUDA: /usr/local/cuda (found version "11.4") 
-- Found headers CUDA : /usr/local/cuda/include
-- Found lib CUDA : /usr/local/cuda/lib64/libcudart_static.a;Threads::Threads;dl;/usr/lib64/librt.so
-- Configuring done
-- Generating done
-- Build files have been written to: XXX/TestCudaCapabilities/build
Scanning dependencies of target test_cuda_capabilities
[ 50%] Building CXX object CMakeFiles/test_cuda_capabilities.dir/main.cpp.o
[100%] Linking CXX executable test_cuda_capabilities
[100%] Built target test_cuda_capabilities
Detected 3 CUDA Capable device(s)


Device 0: "NVIDIA A100-PCIE-40GB"
	CUDA Driver Version / Runtime Version          11.4 / 11.4
	CUDA Capability Major/Minor version number:    8.0
	Total amount of global memory:                 40536 MBytes (42505273344 bytes)
	MapSMtoCores for SM 8.0 is undefined.  Default to use 128 Cores/SM
	(108) Multiprocessors, (128) CUDA Cores/MP:     13824 CUDA Cores
	GPU Clock rate:                                1410 MHz (1.41 GHz)
	Maximum Texture Dimension Size (x,y,z)         1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
	Maximum Layered 1D Texture Size, (num) layers  1D=(32768), 2048 layers
	Maximum Layered 2D Texture Size, (num) layers  2D=(32768, 32768), 2048 layers
	Total amount of constant memory:               65536 bytes
	Total amount of shared memory per block:       49152 bytes
	Total number of registers available per block: 65536
	Warp size:                                     32
	Maximum number of threads per multiprocessor:  2048
	Maximum number of threads per block:           1024
	Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
	Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
	Maximum memory pitch:                          2147483647 bytes
	Texture alignment:                             512 bytes
	Concurrent copy and kernel execution:          Yes with 3 copy engine(s)
	Run time limit on kernels:                     No
	Integrated GPU sharing Host Memory:            No
	Support host page-locked memory mapping:       Yes
	Alignment requirement for Surfaces:            Yes
	Device has ECC support:                        Enabled


Device 1: "NVIDIA A100-PCIE-40GB"
	CUDA Driver Version / Runtime Version          11.4 / 11.4
	CUDA Capability Major/Minor version number:    8.0
	Total amount of global memory:                 40536 MBytes (42505273344 bytes)
	MapSMtoCores for SM 8.0 is undefined.  Default to use 128 Cores/SM
	(108) Multiprocessors, (128) CUDA Cores/MP:     13824 CUDA Cores
	GPU Clock rate:                                1410 MHz (1.41 GHz)
	Maximum Texture Dimension Size (x,y,z)         1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
	Maximum Layered 1D Texture Size, (num) layers  1D=(32768), 2048 layers
	Maximum Layered 2D Texture Size, (num) layers  2D=(32768, 32768), 2048 layers
	Total amount of constant memory:               65536 bytes
	Total amount of shared memory per block:       49152 bytes
	Total number of registers available per block: 65536
	Warp size:                                     32
	Maximum number of threads per multiprocessor:  2048
	Maximum number of threads per block:           1024
	Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
	Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
	Maximum memory pitch:                          2147483647 bytes
	Texture alignment:                             512 bytes
	Concurrent copy and kernel execution:          Yes with 3 copy engine(s)
	Run time limit on kernels:                     No
	Integrated GPU sharing Host Memory:            No
	Support host page-locked memory mapping:       Yes
	Alignment requirement for Surfaces:            Yes
	Device has ECC support:                        Enabled


Device 2: "NVIDIA A100-PCIE-40GB"
	CUDA Driver Version / Runtime Version          11.4 / 11.4
	CUDA Capability Major/Minor version number:    8.0
	Total amount of global memory:                 40536 MBytes (42505273344 bytes)
	MapSMtoCores for SM 8.0 is undefined.  Default to use 128 Cores/SM
	(108) Multiprocessors, (128) CUDA Cores/MP:     13824 CUDA Cores
	GPU Clock rate:                                1410 MHz (1.41 GHz)
	Maximum Texture Dimension Size (x,y,z)         1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
	Maximum Layered 1D Texture Size, (num) layers  1D=(32768), 2048 layers
	Maximum Layered 2D Texture Size, (num) layers  2D=(32768, 32768), 2048 layers
	Total amount of constant memory:               65536 bytes
	Total amount of shared memory per block:       49152 bytes
	Total number of registers available per block: 65536
	Warp size:                                     32
	Maximum number of threads per multiprocessor:  2048
	Maximum number of threads per block:           1024
	Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
	Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
	Maximum memory pitch:                          2147483647 bytes
	Texture alignment:                             512 bytes
	Concurrent copy and kernel execution:          Yes with 3 copy engine(s)
	Run time limit on kernels:                     No
	Integrated GPU sharing Host Memory:            No
	Support host page-locked memory mapping:       Yes
	Alignment requirement for Surfaces:            Yes
	Device has ECC support:                        Enabled

Voilà ce que l'on obtient sur la machine lapp-wngpu007

cat test_cuda_capabilities_007.output
Used machine is Linux lapp-wngpu007.in2p3.fr 3.10.0-1160.42.2.el7.x86_64 #1 SMP Tue Sep 7 14:49:57 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
Remove existing directory build
-- The C compiler identification is GNU 4.8.5
-- The CXX compiler identification is GNU 7.3.1
-- Check for working C compiler: /usr/bin/cc
-- Check for working C compiler: /usr/bin/cc - works
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Detecting C compile features
-- Detecting C compile features - done
-- Check for working CXX compiler: /opt/rh/devtoolset-7/root/usr/bin/x86_64-redhat-linux-g++
-- Check for working CXX compiler: /opt/rh/devtoolset-7/root/usr/bin/x86_64-redhat-linux-g++ - works
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Looking for pthread.h
-- Looking for pthread.h - found
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Failed
-- Looking for pthread_create in pthreads
-- Looking for pthread_create in pthreads - not found
-- Looking for pthread_create in pthread
-- Looking for pthread_create in pthread - found
-- Found Threads: TRUE  
-- Found CUDA: /usr/local/cuda (found version "11.4") 
-- Found headers CUDA : /usr/local/cuda/include
-- Found lib CUDA : /usr/local/cuda/lib64/libcudart_static.a;Threads::Threads;dl;/usr/lib64/librt.so
-- Configuring done
-- Generating done
-- Build files have been written to: XXX/TestCudaCapabilities/build
Scanning dependencies of target test_cuda_capabilities
[ 50%] Building CXX object CMakeFiles/test_cuda_capabilities.dir/main.cpp.o
[100%] Linking CXX executable test_cuda_capabilities
[100%] Built target test_cuda_capabilities
Detected 1 CUDA Capable device(s)


Device 0: "NVIDIA A100-PCIE-40GB MIG 3g.20gb"
	CUDA Driver Version / Runtime Version          11.4 / 11.4
	CUDA Capability Major/Minor version number:    8.0
	Total amount of global memory:                 20096 MBytes (21072183296 bytes)
	MapSMtoCores for SM 8.0 is undefined.  Default to use 128 Cores/SM
	(42) Multiprocessors, (128) CUDA Cores/MP:     5376 CUDA Cores
	GPU Clock rate:                                1410 MHz (1.41 GHz)
	Maximum Texture Dimension Size (x,y,z)         1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
	Maximum Layered 1D Texture Size, (num) layers  1D=(32768), 2048 layers
	Maximum Layered 2D Texture Size, (num) layers  2D=(32768, 32768), 2048 layers
	Total amount of constant memory:               65536 bytes
	Total amount of shared memory per block:       49152 bytes
	Total number of registers available per block: 65536
	Warp size:                                     32
	Maximum number of threads per multiprocessor:  2048
	Maximum number of threads per block:           1024
	Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
	Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
	Maximum memory pitch:                          2147483647 bytes
	Texture alignment:                             512 bytes
	Concurrent copy and kernel execution:          Yes with 3 copy engine(s)
	Run time limit on kernels:                     No
	Integrated GPU sharing Host Memory:            No
	Support host page-locked memory mapping:       Yes
	Alignment requirement for Surfaces:            Yes
	Device has ECC support:                        Enabled

Voilà ce que l'on obtient sur la machine lapp-wngpu003

cat test_cuda_capabilities_003.output
Used machine is Linux lapp-wngpu003.in2p3.fr 3.10.0-1160.42.2.el7.x86_64 #1 SMP Tue Sep 7 14:49:57 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
Remove existing directory build
-- The C compiler identification is GNU 4.8.5
-- The CXX compiler identification is GNU 7.3.1
-- Check for working C compiler: /usr/bin/cc
-- Check for working C compiler: /usr/bin/cc - works
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Detecting C compile features
-- Detecting C compile features - done
-- Check for working CXX compiler: /opt/rh/devtoolset-7/root/usr/bin/x86_64-redhat-linux-g++
-- Check for working CXX compiler: /opt/rh/devtoolset-7/root/usr/bin/x86_64-redhat-linux-g++ - works
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Looking for pthread.h
-- Looking for pthread.h - found
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Failed
-- Looking for pthread_create in pthreads
-- Looking for pthread_create in pthreads - not found
-- Looking for pthread_create in pthread
-- Looking for pthread_create in pthread - found
-- Found Threads: TRUE  
-- Found CUDA: /usr/local/cuda (found version "11.4") 
-- Found headers CUDA : /usr/local/cuda/include
-- Found lib CUDA : /usr/local/cuda/lib64/libcudart_static.a;Threads::Threads;dl;/usr/lib64/librt.so
-- Configuring done
-- Generating done
-- Build files have been written to: XXX/TestCudaCapabilities/build
Scanning dependencies of target test_cuda_capabilities
[ 50%] Building CXX object CMakeFiles/test_cuda_capabilities.dir/main.cpp.o
[100%] Linking CXX executable test_cuda_capabilities
[100%] Built target test_cuda_capabilities
Detected 2 CUDA Capable device(s)


Device 0: "Tesla K80"
	CUDA Driver Version / Runtime Version          11.4 / 11.4
	CUDA Capability Major/Minor version number:    3.7
	Total amount of global memory:                 11441 MBytes (11997020160 bytes)
	(13) Multiprocessors, (192) CUDA Cores/MP:     2496 CUDA Cores
	GPU Clock rate:                                824 MHz (0.82 GHz)
	Maximum Texture Dimension Size (x,y,z)         1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096)
	Maximum Layered 1D Texture Size, (num) layers  1D=(16384), 2048 layers
	Maximum Layered 2D Texture Size, (num) layers  2D=(16384, 16384), 2048 layers
	Total amount of constant memory:               65536 bytes
	Total amount of shared memory per block:       49152 bytes
	Total number of registers available per block: 65536
	Warp size:                                     32
	Maximum number of threads per multiprocessor:  2048
	Maximum number of threads per block:           1024
	Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
	Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
	Maximum memory pitch:                          2147483647 bytes
	Texture alignment:                             512 bytes
	Concurrent copy and kernel execution:          Yes with 2 copy engine(s)
	Run time limit on kernels:                     No
	Integrated GPU sharing Host Memory:            No
	Support host page-locked memory mapping:       Yes
	Alignment requirement for Surfaces:            Yes
	Device has ECC support:                        Enabled


Device 1: "Tesla K80"
	CUDA Driver Version / Runtime Version          11.4 / 11.4
	CUDA Capability Major/Minor version number:    3.7
	Total amount of global memory:                 11441 MBytes (11997020160 bytes)
	(13) Multiprocessors, (192) CUDA Cores/MP:     2496 CUDA Cores
	GPU Clock rate:                                824 MHz (0.82 GHz)
	Maximum Texture Dimension Size (x,y,z)         1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096)
	Maximum Layered 1D Texture Size, (num) layers  1D=(16384), 2048 layers
	Maximum Layered 2D Texture Size, (num) layers  2D=(16384, 16384), 2048 layers
	Total amount of constant memory:               65536 bytes
	Total amount of shared memory per block:       49152 bytes
	Total number of registers available per block: 65536
	Warp size:                                     32
	Maximum number of threads per multiprocessor:  2048
	Maximum number of threads per block:           1024
	Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
	Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
	Maximum memory pitch:                          2147483647 bytes
	Texture alignment:                             512 bytes
	Concurrent copy and kernel execution:          Yes with 2 copy engine(s)
	Run time limit on kernels:                     No
	Integrated GPU sharing Host Memory:            No
	Support host page-locked memory mapping:       Yes
	Alignment requirement for Surfaces:            Yes
	Device has ECC support:                        Enabled

Voilà ce que l'on obtient sur la machine lapp-wngpu006

cat test_cuda_capabilities_006.output
Used machine is Linux lapp-wngpu006.in2p3.fr 3.10.0-1160.42.2.el7.x86_64 #1 SMP Tue Sep 7 14:49:57 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
Remove existing directory build
-- The C compiler identification is GNU 4.8.5
-- The CXX compiler identification is GNU 7.3.1
-- Check for working C compiler: /usr/bin/cc
-- Check for working C compiler: /usr/bin/cc - works
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Detecting C compile features
-- Detecting C compile features - done
-- Check for working CXX compiler: /opt/rh/devtoolset-7/root/usr/bin/x86_64-redhat-linux-g++
-- Check for working CXX compiler: /opt/rh/devtoolset-7/root/usr/bin/x86_64-redhat-linux-g++ - works
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Looking for pthread.h
-- Looking for pthread.h - found
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Failed
-- Looking for pthread_create in pthreads
-- Looking for pthread_create in pthreads - not found
-- Looking for pthread_create in pthread
-- Looking for pthread_create in pthread - found
-- Found Threads: TRUE  
-- Found CUDA: /usr/local/cuda (found version "11.4") 
-- Found headers CUDA : /usr/local/cuda/include
-- Found lib CUDA : /usr/local/cuda/lib64/libcudart_static.a;Threads::Threads;dl;/usr/lib64/librt.so
-- Configuring done
-- Generating done
-- Build files have been written to: XXX/TestCudaCapabilities/build
Scanning dependencies of target test_cuda_capabilities
[ 50%] Building CXX object CMakeFiles/test_cuda_capabilities.dir/main.cpp.o
[100%] Linking CXX executable test_cuda_capabilities
[100%] Built target test_cuda_capabilities
Detected 4 CUDA Capable device(s)


Device 0: "Tesla T4"
	CUDA Driver Version / Runtime Version          11.4 / 11.4
	CUDA Capability Major/Minor version number:    7.5
	Total amount of global memory:                 15110 MBytes (15843721216 bytes)
	MapSMtoCores for SM 7.5 is undefined.  Default to use 128 Cores/SM
	(40) Multiprocessors, (128) CUDA Cores/MP:     5120 CUDA Cores
	GPU Clock rate:                                1590 MHz (1.59 GHz)
	Maximum Texture Dimension Size (x,y,z)         1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
	Maximum Layered 1D Texture Size, (num) layers  1D=(32768), 2048 layers
	Maximum Layered 2D Texture Size, (num) layers  2D=(32768, 32768), 2048 layers
	Total amount of constant memory:               65536 bytes
	Total amount of shared memory per block:       49152 bytes
	Total number of registers available per block: 65536
	Warp size:                                     32
	Maximum number of threads per multiprocessor:  1024
	Maximum number of threads per block:           1024
	Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
	Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
	Maximum memory pitch:                          2147483647 bytes
	Texture alignment:                             512 bytes
	Concurrent copy and kernel execution:          Yes with 3 copy engine(s)
	Run time limit on kernels:                     No
	Integrated GPU sharing Host Memory:            No
	Support host page-locked memory mapping:       Yes
	Alignment requirement for Surfaces:            Yes
	Device has ECC support:                        Enabled


Device 1: "Tesla T4"
	CUDA Driver Version / Runtime Version          11.4 / 11.4
	CUDA Capability Major/Minor version number:    7.5
	Total amount of global memory:                 15110 MBytes (15843721216 bytes)
	MapSMtoCores for SM 7.5 is undefined.  Default to use 128 Cores/SM
	(40) Multiprocessors, (128) CUDA Cores/MP:     5120 CUDA Cores
	GPU Clock rate:                                1590 MHz (1.59 GHz)
	Maximum Texture Dimension Size (x,y,z)         1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
	Maximum Layered 1D Texture Size, (num) layers  1D=(32768), 2048 layers
	Maximum Layered 2D Texture Size, (num) layers  2D=(32768, 32768), 2048 layers
	Total amount of constant memory:               65536 bytes
	Total amount of shared memory per block:       49152 bytes
	Total number of registers available per block: 65536
	Warp size:                                     32
	Maximum number of threads per multiprocessor:  1024
	Maximum number of threads per block:           1024
	Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
	Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
	Maximum memory pitch:                          2147483647 bytes
	Texture alignment:                             512 bytes
	Concurrent copy and kernel execution:          Yes with 3 copy engine(s)
	Run time limit on kernels:                     No
	Integrated GPU sharing Host Memory:            No
	Support host page-locked memory mapping:       Yes
	Alignment requirement for Surfaces:            Yes
	Device has ECC support:                        Enabled


Device 2: "Tesla T4"
	CUDA Driver Version / Runtime Version          11.4 / 11.4
	CUDA Capability Major/Minor version number:    7.5
	Total amount of global memory:                 15110 MBytes (15843721216 bytes)
	MapSMtoCores for SM 7.5 is undefined.  Default to use 128 Cores/SM
	(40) Multiprocessors, (128) CUDA Cores/MP:     5120 CUDA Cores
	GPU Clock rate:                                1590 MHz (1.59 GHz)
	Maximum Texture Dimension Size (x,y,z)         1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
	Maximum Layered 1D Texture Size, (num) layers  1D=(32768), 2048 layers
	Maximum Layered 2D Texture Size, (num) layers  2D=(32768, 32768), 2048 layers
	Total amount of constant memory:               65536 bytes
	Total amount of shared memory per block:       49152 bytes
	Total number of registers available per block: 65536
	Warp size:                                     32
	Maximum number of threads per multiprocessor:  1024
	Maximum number of threads per block:           1024
	Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
	Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
	Maximum memory pitch:                          2147483647 bytes
	Texture alignment:                             512 bytes
	Concurrent copy and kernel execution:          Yes with 3 copy engine(s)
	Run time limit on kernels:                     No
	Integrated GPU sharing Host Memory:            No
	Support host page-locked memory mapping:       Yes
	Alignment requirement for Surfaces:            Yes
	Device has ECC support:                        Enabled


Device 3: "Tesla T4"
	CUDA Driver Version / Runtime Version          11.4 / 11.4
	CUDA Capability Major/Minor version number:    7.5
	Total amount of global memory:                 15110 MBytes (15843721216 bytes)
	MapSMtoCores for SM 7.5 is undefined.  Default to use 128 Cores/SM
	(40) Multiprocessors, (128) CUDA Cores/MP:     5120 CUDA Cores
	GPU Clock rate:                                1590 MHz (1.59 GHz)
	Maximum Texture Dimension Size (x,y,z)         1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
	Maximum Layered 1D Texture Size, (num) layers  1D=(32768), 2048 layers
	Maximum Layered 2D Texture Size, (num) layers  2D=(32768, 32768), 2048 layers
	Total amount of constant memory:               65536 bytes
	Total amount of shared memory per block:       49152 bytes
	Total number of registers available per block: 65536
	Warp size:                                     32
	Maximum number of threads per multiprocessor:  1024
	Maximum number of threads per block:           1024
	Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
	Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
	Maximum memory pitch:                          2147483647 bytes
	Texture alignment:                             512 bytes
	Concurrent copy and kernel execution:          Yes with 3 copy engine(s)
	Run time limit on kernels:                     No
	Integrated GPU sharing Host Memory:            No
	Support host page-locked memory mapping:       Yes
	Alignment requirement for Surfaces:            Yes
	Device has ECC support:                        Enabled

Voilà ce que l'on obtient sur la machine lapp-wngpu004

cat test_cuda_capabilities_004.output
Used machine is Linux lapp-wngpu004.in2p3.fr 3.10.0-1160.42.2.el7.x86_64 #1 SMP Tue Sep 7 14:49:57 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
Remove existing directory build
-- The C compiler identification is GNU 4.8.5
-- The CXX compiler identification is GNU 7.3.1
-- Check for working C compiler: /usr/bin/cc
-- Check for working C compiler: /usr/bin/cc - works
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Detecting C compile features
-- Detecting C compile features - done
-- Check for working CXX compiler: /opt/rh/devtoolset-7/root/usr/bin/x86_64-redhat-linux-g++
-- Check for working CXX compiler: /opt/rh/devtoolset-7/root/usr/bin/x86_64-redhat-linux-g++ - works
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Looking for pthread.h
-- Looking for pthread.h - found
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Failed
-- Looking for pthread_create in pthreads
-- Looking for pthread_create in pthreads - not found
-- Looking for pthread_create in pthread
-- Looking for pthread_create in pthread - found
-- Found Threads: TRUE  
-- Found CUDA: /usr/local/cuda (found version "11.4") 
-- Found headers CUDA : /usr/local/cuda/include
-- Found lib CUDA : /usr/local/cuda/lib64/libcudart_static.a;Threads::Threads;dl;/usr/lib64/librt.so
-- Configuring done
-- Generating done
-- Build files have been written to: XXX/TestCudaCapabilities/build
Scanning dependencies of target test_cuda_capabilities
[ 50%] Building CXX object CMakeFiles/test_cuda_capabilities.dir/main.cpp.o
[100%] Linking CXX executable test_cuda_capabilities
[100%] Built target test_cuda_capabilities
Detected 1 CUDA Capable device(s)


Device 0: "Tesla V100-PCIE-16GB"
	CUDA Driver Version / Runtime Version          11.4 / 11.4
	CUDA Capability Major/Minor version number:    7.0
	Total amount of global memory:                 16160 MBytes (16945512448 bytes)
	MapSMtoCores for SM 7.0 is undefined.  Default to use 128 Cores/SM
	(80) Multiprocessors, (128) CUDA Cores/MP:     10240 CUDA Cores
	GPU Clock rate:                                1380 MHz (1.38 GHz)
	Maximum Texture Dimension Size (x,y,z)         1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
	Maximum Layered 1D Texture Size, (num) layers  1D=(32768), 2048 layers
	Maximum Layered 2D Texture Size, (num) layers  2D=(32768, 32768), 2048 layers
	Total amount of constant memory:               65536 bytes
	Total amount of shared memory per block:       49152 bytes
	Total number of registers available per block: 65536
	Warp size:                                     32
	Maximum number of threads per multiprocessor:  2048
	Maximum number of threads per block:           1024
	Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
	Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
	Maximum memory pitch:                          2147483647 bytes
	Texture alignment:                             512 bytes
	Concurrent copy and kernel execution:          Yes with 7 copy engine(s)
	Run time limit on kernels:                     No
	Integrated GPU sharing Host Memory:            No
	Support host page-locked memory mapping:       Yes
	Alignment requirement for Surfaces:            Yes
	Device has ECC support:                        Enabled