In January, the old GPU cluster will gradually be dismantled and integrated into a new Kubernetes cluster. The reason are massive hardware upgrades of the backbone infrastructure:
* New Ceph-based storage cluster with currenly 180TB 210TB of NVMe storage to supply all compute nodes with data.
* New network backbone: HDR infiniband (200 GB/s).
* Triple-redundant servers to supply basic services and serve API requests, so that downtime should be minimized.
* As a cherry on top, another GPU server with 4x A100.
Since we reinstall everything from scratch, the usage of the Cluster will also change slightly, both for easier access to storage (getting rid of the somewhat cumbersome need to allocate persistent volumes) and improved security (separate user namespaces).
We first provide a comprehensive list of changes in how to use the cluster, then give a detailed manual for how to move over your data and pods.
== Pod configuration on the new cluster ==