Skip to content

GPU Cluster Capella

Acceptance phase

The cluster Capella is currently in the acceptance phase, i.e., interruptions and reboots without notice, node failures are possible. Furthermore, the systems configuration might be adjusted further.

Do not yet move your "production" to Capella, but feel free to test it using moderate sized workloads. Please read this page carefully to understand, what you need to adopt in your existing workflows w.r.t. filesystem, software and modules and batch jobs.

We highly appreciate your hints and would be pleased to receive your comments and experiences regarding its operation via e-mail to hpc-support@tu-dresden.de using the subject Capella: .

Please understand that we are current priority is the acceptance, configuration and rollout of the system. Consequently, we are unable to address any support requests at this time.

Overview

The multi-GPU cluster Capella has been installed for AI-related computations and traditional HPC simulations. Capella is fully integrated into the ZIH HPC infrastructure. Therefore, the usage should be similar to the other clusters.

Hardware Specifications

The hardware specification is documented on the page HPC Resources.

Access and Login Nodes

You use login[1-2].capella.hpc.tu-dresden.de to access the cluster Capella from the campus (or VPN) network. In order to verify the SSH fingerprints of the login nodes, please refer to the page Key Fingerprints.

On the login nodes you have access to the same filesystems and the software stack as on the compute node. GPUs are not available there.

In the subsections Filesystems and Software and Modules we provide further information on these two topics.

Filesystems

As with all other clusters, your /home directory is also available on Capella. For reasons of convenience, the filesystems horse and walrus are also accessible. Please note, that the filesystem horse is not to be used as working filesystem at the cluster Capella.

With Capella comes the new filesystem cat designed to meet the high I/O requirements of AI and ML workflows. It is a WEKAio filesystem and mounted under /data/cat. It is only available on the cluster Capella and the Datamover nodes.

Main working filesystem is cat

The filesystem cat should be used as the main working filesystem and has to be used with workspaces. Workspaces on the filesystem cat can only be created on the login and compute nodes, not on the other clusters since cat is not available there.

Although all other filesystems (/home, /software, /data/horse, /data/walrus, etc.) are also available.

Datatransfer to and from /data/cat

Please utilize the new filesystem cat as the working filesystem on Capella. It has limited capacity, so we advise you to only hold hot data on cat. To transfer input and result data from and to the filesystems horse and walrus, respectively, you will need to use the Datamover nodes. Regardless of the direction of transfer, you should pack your data into archives (,e.g., using dttar command) for the transfer.

Do not invoke data transfer to the filesystems horse and walrus from login nodes. Both login nodes are part of the cluster. Failures, reboots and other work might affect your data transfer resulting in data corruption.

Software and Modules

The most straightforward method for utilizing the software is through the well-known module system. All software available from the module system has been specifically build for the cluster Capella i.e., with optimization for Zen4 (Genoa) microarchitecture and CUDA-support enabled.

Python Virtual Environments

Virtual environments allow you to install additional Python packages and create an isolated runtime environment. We recommend using venv for this purpose.

Virtual environments in workspaces

We recommend to use workspaces for your virtual environments.

Batchsystem

The batch system Slurm may be used as usual. Please refer to the page Batch System Slurm for detailed information. In addition, the page Job Examples provides examples on GPU allocation with Slurm.

You can find out about upcoming reservations (,e.g., for acceptance benchmarks) via sinfo -T. Acceptance has priority, so your reservation requests can currently not be considered.

Slurm limits and job runtime

Although, each compute node is equipped with 64 CPU cores in total, only a maximum of 56 can be requested via Slurm (cf. Slurm Resource Limits Table).

The maximum runtime of jobs and interactive sessions is currently 24 hours. However, to allow for greater fluctuation in testing, please make the jobs shorter if possible. You can use Chain Jobs to split a long running job exceeding the batch queues limits into parts and chain these parts. Applications with build-in check-point-restart functionality are very suitable for this approach! If your application provides check-point-restart, please use /data/cat for temporary data. Remove these data afterwards!

The partition capella-interactive can be used for your small tests and compilation of software. You need to add #SBATCH --partition=capella-interactive to your jobfile and --partition=capella-interactive to your sbatch, srun and salloc command line, respectively, to address this partition. The partitions configuration might be adopted within acceptance phase. You get the current settings via scontrol show partitions capella-interactive.