GPU Programming¶
Available GPUs¶
The full hardware specifications of the GPU-compute nodes may be found in the HPC Resources page. Note that the clusters may have different modules available:
E.g. the available CUDA versions can be listed with
marie@compute$ module spider CUDA
Note that some modules use a specific CUDA version which is visible in the module name,
e.g. GDRCopy/2.1-CUDA-11.1.1
or Horovod/0.28.1-CUDA-11.7.0-TensorFlow-2.11.0
.
This especially applies to the optimized CUDA libraries like cuDNN
, NCCL
and magma
.
CUDA-aware MPI
When running CUDA applications using MPI for interprocess communication you need to additionally load the modules
that enable CUDA-aware MPI which may provide improved performance.
Those are UCX-CUDA
and UCC-CUDA
which supplement the UCX
and UCC
modules respectively.
Some modules, like NCCL
, load those automatically.
Using GPUs with Slurm¶
For general information on how to use Slurm, read the respective page in this compendium.
When allocating resources on a GPU-node, you must specify the number of requested GPUs by using the
--gres=gpu:<N>
option, like this:
#!/bin/bash # Batch script starts with shebang line
#SBATCH --ntasks=1 # All #SBATCH lines have to follow uninterrupted
#SBATCH --time=01:00:00 # after the shebang line
#SBATCH --account=p_number_crunch # Comments start with # and do not count as interruptions
#SBATCH --job-name=fancyExp
#SBATCH --output=simulation-%j.out
#SBATCH --error=simulation-%j.err
#SBATCH --gres=gpu:1 # request GPU(s) from Slurm
module purge # Set up environment, e.g., clean modules environment
module load module/version module2 # and load necessary modules
srun ./application [options] # Execute parallel application with srun
Alternatively, you can work on the clusters interactively:
marie@login.<cluster_name>$ srun --nodes=1 --gres=gpu:<N> --runtime=00:30:00 --pty bash
marie@compute$ module purge; module switch release/<env>
Directive Based GPU Programming¶
Directives are special compiler commands in your C/C++ or Fortran source code. They tell the compiler how to parallelize and offload work to a GPU. This section explains how to use this technique.
OpenACC¶
OpenACC is a directive based GPU programming model. It currently only supports NVIDIA GPUs as a target.
Please use the following information as a start on OpenACC:
Introduction¶
OpenACC can be used with the PGI and NVIDIA HPC compilers. The NVIDIA HPC compiler, as part of the NVIDIA HPC SDK, supersedes the PGI compiler.
The nvc
compiler (NOT the nvcc
compiler, which is used for CUDA) is available for the NVIDIA
Tesla V100 and Nvidia A100 nodes.
Using OpenACC with PGI compilers¶
- Load the latest version via
module load PGI
or search for available versions withmodule search PGI
- For compilation, please add the compiler flag
-acc
to enable OpenACC interpreting by the compiler -Minfo
tells you what the compiler is actually doing to your code- Add
-ta=nvidia:ampere
to enable optimizations for the A100 GPUs - You may find further information on the PGI compiler in the user guide and in the reference guide, which includes descriptions of available command line options
Using OpenACC with NVIDIA HPC compilers¶
- Switch into the correct module environment for your selected compute nodes (see list of available GPUs)
- Load the
NVHPC
module for the correct module environment. Either load the default (module load NVHPC
) or search for a specific version. - Use the correct compiler for your code:
nvc
for C,nvc++
for C++ andnvfortran
for Fortran - Use the
-acc
and-Minfo
flag as with the PGI compiler - To create optimized code for either the V100 or A100, use
-gpu=cc70
or-gpu=cc80
, respectively - Further information on this compiler is provided in the user guide and the reference guide, which includes descriptions of available command line options
- Information specific the use of OpenACC with the NVIDIA HPC compiler is compiled in a guide
OpenMP target offloading¶
OpenMP supports target offloading as of version 4.0. A dedicated set of compiler directives can be used to annotate code-sections that are intended for execution on the GPU (i.e., target offloading). Not all compilers with OpenMP support target offloading, refer to the official list for details. Furthermore, some compilers, such as GCC, have basic support for target offloading, but do not enable these features by default and/or achieve poor performance.
On the ZIH system, compilers with OpenMP target offloading support are provided on the clusters
power9
and alpha
. Two compilers with good performance can be used: the NVIDIA HPC compiler and the
IBM XL compiler.
Using OpenMP target offloading with NVIDIA HPC compilers¶
- Load the module environments and the NVIDIA HPC SDK as described in the OpenACC section
- Use the
-mp=gpu
flag to enable OpenMP with offloading -Minfo
tells you what the compiler is actually doing to your code- The same compiler options as mentioned above are
available for OpenMP, including the
-gpu=ccXY
flag as mentioned above. - OpenMP-specific advice may be found in the respective section in the user guide
Using OpenMP target offloading with the IBM XL compilers¶
The IBM XL compilers (xlc
for C, xlc++
for C++ and xlf
for Fortran (with sub-version for
different versions of Fortran)) are only available on the cluster power9
with NVIDIA Tesla V100 GPUs.
- The
-qsmp -qoffload
combination of flags enables OpenMP target offloading support - Optimizations specific to the V100 GPUs can be enabled by using the
-qtgtarch=sm_70
flag. - IBM provides a XL compiler documentation with a list of supported OpenMP directives and information on target-offloading specifics
Native GPU Programming¶
CUDA¶
Native CUDA programs can sometimes offer a better performance. NVIDIA provides some introductory material and links. An introduction to CUDA is provided as well. The toolkit documentation page links to the programming guide and the best practice guide. Optimization guides for supported NVIDIA architectures are available, including for Volta (V100) and Ampere (A100).
In order to compile an application with CUDA use the nvcc
compiler command, which is described in
detail in nvcc documentation.
This compiler is available via several CUDA
packages, a default version can be loaded via
module load CUDA
. Additionally, the NVHPC
modules provide CUDA tools as well.
For using CUDA with Open MPI at multiple nodes, the OpenMPI
module loaded shall have be compiled
with CUDA support. If you aren't sure if the module you are using has support for it you can check
it as following:
ompi_info --parsable --all | grep mpi_built_with_cuda_support:value | awk -F":" '{print "Open MPI supports CUDA:",$7}'
Usage of the CUDA Compiler¶
The simple invocation nvcc <code.cu>
will compile a valid CUDA program. nvcc
differentiates
between the device and the host code, which will be compiled in separate phases. Therefore, compiler
options can be defined specifically for the device as well as for the host code. By default, the GCC
is used as the host compiler. The following flags may be useful:
--generate-code
(-gencode
): generate optimized code for a target GPU (caution: these binaries cannot be used with GPUs of other generations).- For Volta (V100):
--generate-code arch=compute_70,code=sm_70
, - For Ampere (A100):
--generate-code arch=compute_80,code=sm_80
- For Volta (V100):
-Xcompiler
: pass flags to the host compiler. E.g., generate OpenMP-parallel host code:-Xcompiler -fopenmp
. The-Xcompiler
flag has to be invoked for each host-flag
Performance Analysis¶
Consult NVIDIA's Best Practices Guide and the performance guidelines for possible steps to take for the performance analysis and optimization.
Multiple tools can be used for the performance analysis. For the analysis of applications on the newer GPUs (V100 and A100), we recommend the use of the newer NVIDIA Nsight tools, Nsight Systems for a system-wide sampling and tracing and Nsight Compute for a detailed analysis of individual kernels.
NVIDIA nvprof & Visual Profiler¶
The nvprof command line and the Visual Profiler are available once a CUDA module has been loaded.
For a simple analysis, you can call nvprof
without any options, like such:
marie@compute$ nvprof ./application [options]
For a more in-depth analysis, we recommend you use the command line tool first to generate a report
file, which you can later analyze in the Visual Profiler. In order to collect a set of general
metrics for the analysis in the Visual Profiler, use the --analysis-metrics
flag to collect
metrics and --export-profile
to generate a report file, like this:
marie@compute$ nvprof --analysis-metrics --export-profile <output>.nvvp ./application [options]
Transfer the report file to your local system and analyze it in
the Visual Profiler (nvvp
) locally. This will give the smoothest user experience. Alternatively,
you can use X11-forwarding. Refer to the documentation for details about
the individual
features and views of the Visual Profiler.
Besides these generic analysis methods, you can profile specific aspects of your GPU kernels.
nvprof
can profile specific events. For this, use
marie@compute$ nvprof --query-events
to get a list of available events. Analyze one or more events by using specifying one or more events, separated by comma:
marie@compute$ nvprof --events <event_1>[,<event_2>[,...]] ./application [options]
Additionally, you can analyze specific metrics. Similar to the profiling of events, you can get a list of available metrics:
marie@compute$ nvprof --query-metrics
One or more metrics can be profiled at the same time:
marie@compute$ nvprof --metrics <metric_1>[,<metric_2>[,...]] ./application [options]
If you want to limit the profiler's scope to one or more kernels, you can use the
--kernels <kernel_1>[,<kernel_2>]
flag. For further command line options, refer to the
documentation on command line options.
NVIDIA Nsight Systems¶
Use NVIDIA Nsight Systems for a system-wide sampling of your code. Refer to the NVIDIA Nsight Systems User Guide for details. With this, you can identify parts of your code that take a long time to run and are suitable optimization candidates.
Use the command-line version to sample your code and create a report file for later analysis:
marie@compute$ nsys profile [--stats=true] ./application [options]
The --stats=true
flag is optional and will create a summary on the command line. Depending on your
needs, this analysis may be sufficient to identify optimizations targets.
The graphical user interface version can be used for a thorough analysis of your previously
generated report file. For an optimal user experience, we recommend a local installation of NVIDIA
Nsight Systems. In this case, you can
transfer the report file to your local system.
Alternatively, you can use X11-forwarding. The graphical user interface is
usually available as nsys-ui
.
Furthermore, you can use the command line interface for further analyses. Refer to the documentation for a list of available command line options.
NVIDIA Nsight Compute¶
Nsight Compute is used for the analysis of individual GPU-kernels. It supports GPUs from the Volta architecture onward (on the ZIH system: V100 and A100). If you are familiar with nvprof, you may want to consult the Nvprof Transition Guide, as Nsight Compute uses a new scheme for metrics. We recommend those kernels as optimization targets that require a large portion of you run time, according to Nsight Systems. Nsight Compute is particularly useful for CUDA code, as you have much greater control over your code compared to the directive based approaches.
Nsight Compute comes in a command line and a graphical version. Refer to the Kernel Profiling Guide to get an overview of the functionality of these tools.
You can call the command line version (ncu
) without further options to get a broad overview of
your kernel's performance:
marie@compute$ ncu ./application [options]
As with the other profiling tools, the Nsight Compute profiler can generate report files like this:
marie@compute$ ncu --export <report> ./application [options]
The report file will automatically get the file ending .ncu-rep
, you do not need to specify this
manually.
This report file can be analyzed in the graphical user interface profiler. Again, we recommend you
generate a report file on a compute node and
transfer the report file to your local system.
Alternatively, you can use X11-forwarding. The graphical user interface is
usually available as ncu-ui
or nv-nsight-cu
.
Similar to the nvprof
profiler, you can analyze specific metrics. NVIDIA provides a
Metrics Guide. Use
--query-metrics
to get a list of available metrics, listing them by base name. Individual metrics
can be collected by using
marie@compute$ ncu --metrics <metric_1>[,<metric_2>,...] ./application [options]
Collection of events is no longer possible with Nsight Compute. Instead, many nvprof events can be measured with metrics.
You can collect metrics for individual kernels by specifying the --kernel-name
flag.