Difference between revisions of "CCU:Cluster Updates 2020 10"

From Collective Computational Unit
Jump to navigation Jump to search
m (Preview: changes behind the scenes, new nodes)
m (Preview: changes behind the scenes, new nodes)
Line 22: Line 22:
 
== Preview: changes behind the scenes, new nodes ==
 
== Preview: changes behind the scenes, new nodes ==
  
* New storage cluster (Ceph, 3 monitor and 3 OSD nodes, 230 TB NVMe in addition to what we have)
+
* New storage cluster (Ceph, 3 monitor and 3 OSD nodes, 230 TB NVMe in addition to what we already have)
 
* New GPU server: nVidia DGX A100 (with 8x A100, 40 GB/GPU, NVLink)
 
* New GPU server: nVidia DGX A100 (with 8x A100, 40 GB/GPU, NVLink)
 
* New GPU server: Supermicro with 4x A100, 40 GB/GPU, NVLink
 
* New GPU server: Supermicro with 4x A100, 40 GB/GPU, NVLink
* New backbone storage network: EDR Infiniband (100 GB/s)
+
* New backbone storage network: HDR Infiniband (200 GB/s)

Revision as of 10:10, 25 September 2020

Early warning

The Kubernetes cluster will undergo a major hardware update in October.

TL;DR: a complete cluster reinstallation will be necessary due to major changes in the underlying network hardware. New persistent storage will be installed, and all persistent volumes will need to be deleted, as the drives will be integrated into the new system. Please start to backup everything and be prepared to delete all your pods and PVs on short notice.

Some details in using the cluster after the reinstallation will change slightly. Some of them you can test already, please do so and help me find possible bugs before all changes go live. See below for more information about the major changes.

Preview: changes in persistent volumes

Preview: changes in namespace scopes

Preview: changes in permissions and resource use

Preview: changes behind the scenes, new nodes

  • New storage cluster (Ceph, 3 monitor and 3 OSD nodes, 230 TB NVMe in addition to what we already have)
  • New GPU server: nVidia DGX A100 (with 8x A100, 40 GB/GPU, NVLink)
  • New GPU server: Supermicro with 4x A100, 40 GB/GPU, NVLink
  • New backbone storage network: HDR Infiniband (200 GB/s)