CCU:Cluster Updates 2020 10

From Collective Computational Unit
Revision as of 10:16, 25 September 2020 by Bastian.goldluecke (talk | contribs) (Preview: changes in persistent volumes)
Jump to navigation Jump to search

Early warning

The Kubernetes cluster will undergo a major hardware update in October.

TL;DR: a complete cluster reinstallation will be necessary due to major changes in the underlying network hardware. New persistent storage will be installed, and all persistent volumes will need to be deleted, as the drives will be integrated into the new system. Please start to backup everything and be prepared to delete all your pods and PVs on short notice.

Some details in using the cluster after the reinstallation will change slightly. Some of them you can test already, please do so and help me find possible bugs before all changes go live. See below for more information about the major changes.

Preview: changes in persistent volumes

All local NVMe and SSD drives will be integrated into the ceph storage cluster. The nodes will not provide local PVs anymore, and there is only one cluster-wide global storage class. Special read-only storage for shared datasets will be provided as before, but driven by ceph instead of a NFS export.

What you can do right now already to get rid of probably most of the data on your PVs is to move all static datasets to the system-wide storage, as described here. This framework will persist over the cluster reinstall and no data will be lost.

Preview: changes in namespace scopes

Preview: changes in permissions and resource use

Preview: changes behind the scenes, new nodes

  • New storage cluster (Ceph, 3 monitor and 3 OSD nodes, 230 TB NVMe in addition to what we already have)
  • New GPU server: nVidia DGX A100 (with 8x A100, 40 GB/GPU, NVLink)
  • New GPU server: Supermicro with 4x A100, 40 GB/GPU, NVLink
  • New backbone storage network: HDR Infiniband (200 GB/s)
  • New dedicated Ethernet network for Kubernetes