Difference between revisions of "Cluster:Compute nodes"

Revision as of 18:25, 27 November 2021

Targeting a specific node

Targeting a specific node can be done in two different ways, either selecting a node name directly, or requiring certain labels on the node. See table below for node names and associated labels.

Selecting a node name

Example: GPU-enabled pod which runs only on the node "belial":

apiVersion: v1
kind: Pod
metadata:
  name: gpu-pod
spec:
  nodeSelector:
    kubernetes.io/hostname: belial
  containers:
  - name: gpu-container
    image: nvcr.io/nvidia/tensorflow:20.09-tf2-py3
    command: ["sleep", "1d"]
    resources:
      requests:
        cpu: 1
        nvidia.com/gpu: 1
        memory: 10Gi
      limits:
        cpu: 1
        nvidia.com/gpu: 1
        memory: 10Gi
   # more specs (volumes etc.)

Requiring a certain label on the node

Example: GPU-enabled pod which requires compute capability of at least sm-60:

Acquiring GPUs with more than 20 GB

By default, Kubernetes schedules GPU pods only on the smallest class of GPU with 20 GB of memory. The way how this is achieved is that nodes with higher grade GPUs are assigned a "node taint", which makes the node only available to pods which specify that they are "tolerant" against the taint.

So if your tasks for example requires a GPU with *exactly* 32 GB, you have to

make the pod tolerate the taint "gpumem-32" (see table below).
make the pod require the node label "gpumem-32".

make the pod tolerate the taint "gpumem-32" *and* "gpumem-40".
make the pod require the node label "gpumem-32" *or* "gpumem-40".

Example:

If you need a GPU with *at least* 32 GB, but also would be happy with 40, you have to

make the pod tolerate the taint "gpumem-32" *and* "gpumem-40".
make the pod require the node label "gpumem-32" *or* "gpumem-40".

Example:

List of compute nodes

The following nodes are currently part of the cluster. Note that the master node is CPU only and not used for computations, as it hosts all CCU infrastructure (among a few other things).

CCU name	Access	Platform	GPUs	Labels	Taints
Vecna	exc-cb, inf	nVidia DGX-2	16 x V100 @ 32 GB	gpumem-32, nvidia-v100, nvidia-compute-capability-sm80	gpumem-32
Glasya	trr161	Dual Xeon Rack	4 x Titan RTX @ 24 GB	gpumem-24, nvidia-rtx, nvidia-compute-capability-sm80	gpumem-24
Belial	exc-cb	Supermicro	8 x Quadro RTX 6000 @ 24 GB	gpumem-24, nvidia-rtx, nvidia-compute-capability-sm75	gpumem-24
Fierna	exc-cb	Supermicro	8 x Quadro RTX 6000 @ 24 GB	gpumem-24, nvidia-rtx, nvidia-compute-capability-sm75	gpumem-24
Zariel	trr161	nVidia DGX A100	8 x A100 @ 40 GB	gpumem-40, nvidia-a100, nvidia-compute-capability-sm80	gpumem-40
Tiamat	exc-cb	Supermicro	4 x A100 @ 40 GB	gpumem-40, nvidia-a100, nvidia-compute-capability-sm80	gpumem-40
Asmodeus	all	Supermicro	4 x A100 HGX 320 GB, subdivided in 16 GPUs @ 20 GB	gpumem-20, nvidia-a100, nvidia-compute-capability-sm80
Demogorgon	exc-cb	Delta	8 x A40 @ 40 GB	gpumem-40, nvidia-a40, nvidia-compute-capability-sm80	gpumem-40

The CCU name is the internal name used in the Kubernetes cluster, as well as the configured hostname of the node. Nodes are not accessible from the outside world, you have to access the cluster via kubectl through the API-server.

In the column "Access" you can find which Kubernetes user groups can access this node.

Group	Desciption
exc-cb	Centre for the Advanced Study of Collective Behaviour
trr161	SFB Transregio 161 "Quantitative Methods for Visual Computing"
inf	Department of Computer Science
cvia	Computer Vision and Image Analysis Group

Difference between revisions of "Cluster:Compute nodes"

Revision as of 18:25, 27 November 2021

Contents

Targeting a specific node

Selecting a node name

Requiring a certain label on the node

Acquiring GPUs with more than 20 GB

List of compute nodes

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Collective Computational Unit

Mediawiki

Tools

Print/export

@@ Line 3: / Line 3: @@
 == Targeting a specific node ==
-Targeting a specific node can be done in two different ways:
+Targeting a specific node can be done in two different ways, either selecting a node name directly, or requiring certain labels on the node.
-# Selecting a node name.
-# Requiring a certain label on the node,
 See table below for node names and associated labels.
-Example 1: GPU-enabled pod which runs only on the node "belial":
-Example 2: GPU-enabled pod which requires compute capability of at least sm-60:
+=== Selecting a node name ===
+Example: GPU-enabled pod which runs only on the node "belial":
+<syntaxhighlight>
+apiVersion: v1
+kind: Pod
+metadata:
+  name: gpu-pod
+spec:
+  nodeSelector:
+    kubernetes.io/hostname: belial
+  containers:
+  - name: gpu-container
+    image: nvcr.io/nvidia/tensorflow:20.09-tf2-py3
+    command: ["sleep", "1d"]
+    resources:
+      requests:
+        cpu: 1
+        nvidia.com/gpu: 1
+        memory: 10Gi
+      limits:
+        cpu: 1
+        nvidia.com/gpu: 1
+        memory: 10Gi
+   # more specs (volumes etc.)
+</syntaxhighlight>
+=== Requiring a certain label on the node ===
+Example: GPU-enabled pod which requires compute capability of at least sm-60:
 == Acquiring GPUs with more than 20 GB ==