Changes

CCU:New GPU Cluster

54 bytes added, 4 years ago

m

→‎Overview

In January, the old GPU cluster will gradually be dismantled and integrated into a new Kubernetes cluster. The reason are massive hardware upgrades of the backbone infrastructure:

* New Ceph-based storage cluster with currenly ~~180TB~~ 210TB of NVMe storage to supply all compute nodes with data.

* New network backbone: HDR infiniband (200 GB/s).

* Triple-redundant servers to supply basic services and serve API requests, so that downtime should be minimized.

* As a cherry on top, another GPU server with 4x A100.

Since we reinstall everything from scratch, the usage of the Cluster will also change slightly, both for easier access to storage (getting rid of the somewhat cumbersome need to allocate persistent volumes) and improved security (separate user namespaces).

We first provide a comprehensive list of changes in how to use the cluster, then give a detailed manual for how to move over your data and pods.

== Pod configuration on the new cluster ==

Bastian.goldluecke

ccu, Administrators

684

edits

Changes

CCU:New GPU Cluster

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Collective Computational Unit

Mediawiki

Tools