Difference between revisions of "Cluster:Compute nodes"
m (→Targeting a specific node) |
m (→Requiring a certain label on the node) |
||
| Line 38: | Line 38: | ||
=== Requiring a certain label on the node === | === Requiring a certain label on the node === | ||
| − | Example: GPU-enabled pod which requires compute capability of at least sm- | + | Example: GPU-enabled pod which requires compute capability of at least sm-75: |
| + | |||
| + | <syntaxhighlight> | ||
| + | apiVersion: v1 | ||
| + | kind: Pod | ||
| + | metadata: | ||
| + | name: gpu-pod | ||
| + | spec: | ||
| + | nodeSelector: | ||
| + | compute-capability-atleast-sm75: true | ||
| + | # note: if a node has e.g. the label "compute-capability-sm80", it also has the | ||
| + | # corresponding "atleast"-label for all lower compute capabilities. Same holds for "gpumem". | ||
| + | containers: | ||
| + | - name: gpu-container | ||
| + | image: nvcr.io/nvidia/tensorflow:20.09-tf2-py3 | ||
| + | command: ["sleep", "1d"] | ||
| + | resources: | ||
| + | requests: | ||
| + | cpu: 1 | ||
| + | nvidia.com/gpu: 1 | ||
| + | memory: 10Gi | ||
| + | limits: | ||
| + | cpu: 1 | ||
| + | nvidia.com/gpu: 1 | ||
| + | memory: 10Gi | ||
| + | # more specs (volumes etc.) | ||
| + | </syntaxhighlight> | ||
== Acquiring GPUs with more than 20 GB == | == Acquiring GPUs with more than 20 GB == | ||
Revision as of 18:29, 27 November 2021
Contents
Targeting a specific node
Targeting a specific node can be done in two different ways, either selecting a node name directly, or requiring certain labels on the node. See table below for node names and associated labels.
Selecting a node name
Example: GPU-enabled pod which runs only on the node "belial":
apiVersion: v1
kind: Pod
metadata:
name: gpu-pod
spec:
nodeSelector:
kubernetes.io/hostname: belial
containers:
- name: gpu-container
image: nvcr.io/nvidia/tensorflow:20.09-tf2-py3
command: ["sleep", "1d"]
resources:
requests:
cpu: 1
nvidia.com/gpu: 1
memory: 10Gi
limits:
cpu: 1
nvidia.com/gpu: 1
memory: 10Gi
# more specs (volumes etc.)
Requiring a certain label on the node
Example: GPU-enabled pod which requires compute capability of at least sm-75:
apiVersion: v1
kind: Pod
metadata:
name: gpu-pod
spec:
nodeSelector:
compute-capability-atleast-sm75: true
# note: if a node has e.g. the label "compute-capability-sm80", it also has the
# corresponding "atleast"-label for all lower compute capabilities. Same holds for "gpumem".
containers:
- name: gpu-container
image: nvcr.io/nvidia/tensorflow:20.09-tf2-py3
command: ["sleep", "1d"]
resources:
requests:
cpu: 1
nvidia.com/gpu: 1
memory: 10Gi
limits:
cpu: 1
nvidia.com/gpu: 1
memory: 10Gi
# more specs (volumes etc.)Acquiring GPUs with more than 20 GB
By default, Kubernetes schedules GPU pods only on the smallest class of GPU with 20 GB of memory. The way how this is achieved is that nodes with higher grade GPUs are assigned a "node taint", which makes the node only available to pods which specify that they are "tolerant" against the taint.
So if your tasks for example requires a GPU with *exactly* 32 GB, you have to
- make the pod tolerate the taint "gpumem-32" (see table below).
- make the pod require the node label "gpumem-32".
- make the pod tolerate the taint "gpumem-32" *and* "gpumem-40".
- make the pod require the node label "gpumem-32" *or* "gpumem-40".
Example:
If you need a GPU with *at least* 32 GB, but also would be happy with 40, you have to
- make the pod tolerate the taint "gpumem-32" *and* "gpumem-40".
- make the pod require the node label "gpumem-32" *or* "gpumem-40".
Example:
List of compute nodes
The following nodes are currently part of the cluster. Note that the master node is CPU only and not used for computations, as it hosts all CCU infrastructure (among a few other things).
| CCU name | Access | Platform | GPUs | Labels | Taints |
|---|---|---|---|---|---|
| Vecna | exc-cb, inf | nVidia DGX-2 | 16 x V100 @ 32 GB | gpumem-32, nvidia-v100, nvidia-compute-capability-sm80 | gpumem-32 |
| Glasya | trr161 | Dual Xeon Rack | 4 x Titan RTX @ 24 GB | gpumem-24, nvidia-rtx, nvidia-compute-capability-sm80 | gpumem-24 |
| Belial | exc-cb | Supermicro | 8 x Quadro RTX 6000 @ 24 GB | gpumem-24, nvidia-rtx, nvidia-compute-capability-sm75 | gpumem-24 |
| Fierna | exc-cb | Supermicro | 8 x Quadro RTX 6000 @ 24 GB | gpumem-24, nvidia-rtx, nvidia-compute-capability-sm75 | gpumem-24 |
| Zariel | trr161 | nVidia DGX A100 | 8 x A100 @ 40 GB | gpumem-40, nvidia-a100, nvidia-compute-capability-sm80 | gpumem-40 |
| Tiamat | exc-cb | Supermicro | 4 x A100 @ 40 GB | gpumem-40, nvidia-a100, nvidia-compute-capability-sm80 | gpumem-40 |
| Asmodeus | all | Supermicro | 4 x A100 HGX 320 GB, subdivided in 16 GPUs @ 20 GB | gpumem-20, nvidia-a100, nvidia-compute-capability-sm80 | |
| Demogorgon | exc-cb | Delta | 8 x A40 @ 40 GB | gpumem-40, nvidia-a40, nvidia-compute-capability-sm80 | gpumem-40 |
The CCU name is the internal name used in the Kubernetes cluster, as well as the configured hostname of the node. Nodes are not accessible from the outside world, you have to access the cluster via kubectl through the API-server.
In the column "Access" you can find which Kubernetes user groups can access this node.
| Group | Desciption |
|---|---|
| exc-cb | Centre for the Advanced Study of Collective Behaviour |
| trr161 | SFB Transregio 161 "Quantitative Methods for Visual Computing" |
| inf | Department of Computer Science |
| cvia | Computer Vision and Image Analysis Group |