CCU:Cluster Updates 2020 10

From Collective Computational Unit
Revision as of 20:14, 24 September 2020 by Bastian.goldluecke (talk | contribs) (Preview: changes behind the scenes, new nodes)
Jump to navigation Jump to search

Contents

Early warning

The Kubernetes cluster will undergo a major hardware update in October.

TL;DR: a complete cluster reinstallation will be necessary due to major changes in the underlying network hardware. New persistent storage will be installed, and all persistent volumes will need to be deleted, as the drives will be integrated into the new system. Please start to backup everything and be prepared to delete all your pods and PVs on short notice.

Some details in using the cluster after the reinstallation will change slightly. Some of them you can test already, please do so and help me find possible bugs before all changes go live. See below for more information about the major changes.

Preview: changes in persistent volumes

Preview: changes in namespace scopes

Preview: changes in permissions and resource use

Preview: changes behind the scenes, new nodes

  • New storage cluster (Ceph, 3 monitor and 3 OSD nodes, 230 TB NVMe in addition to what we have)
  • New GPU server: nVidia DGX A100 (with 8x A100, 40 GB/GPU, NVLink)
  • New GPU server: Supermicro with 4x A100, 40 GB/GPU, NVLink
  • New backbone storage network: EDR Infiniband (100 GB/s)