Changes

Jump to navigation Jump to search

Cluster:Compute nodes

656 bytes added, 1 year ago
m
no edit summary
== List of compute nodes ==
'''NOTE: Asmodeus Imp and Demogorgon are orderedDretch do not have an infiniband connection, but not installed yetso ceph filesystem access is slightly slower. Using the local raid for caching data is recommended.Both machines (Imp in particular) have much less powerful GPUs than the rest of the cluster, so these two systems are ideal for testing and taints are currently not yet in placeexperimenting.
'''
 
The following GPU nodes are currently part of the cluster. There are more nodes which act as API servers or provide the Ceph filesystem and web services, but these are not available for standard users.
 
Note: Labels / Taints in this table might be outdated, use "kubectl describe node <name>" for up-to-date information.
{| class="wikitable"
! scope="col"| Taints
|-
! scope="row"| Impimp| trr161all
| Dual Xeon Rack
| 4 x Titan RTX Xp @ 24 12 GB| gpumem=2412, gpuarch=nvidia-rtxtitan, nvidia-compute-capability-sm80sm70=true| no-infiniband=true:NoSchedule
|-
! scope="row"| Lemuredretch| trr161all
| Dual Xeon Rack
| 4 x Titan RTX @ 24 GB
| gpumem=24, gpuarch=nvidia-rtxtitan, nvidia-compute-capability-sm80sm70=true| no-infiniband=true:NoSchedule
|-
! scope="row"| Belialbelial
| exc-cb
| Supermicro
| gpumem=24:NoSchedule
|-
! scope="row"| Fiernafierna
| exc-cb
| Supermicro
| gpumem=24:NoSchedule
|-
! scope="row"| Vecnavecna
| exc-cb, inf
| nVidia DGX-2
| gpumem=32:NoSchedule
|-
! scope="row"| Zarielzariel
| trr161
| nVidia DGX A100
| gpumem=40:NoSchedule
|-
! scope="row"| Tiamattiamat
| exc-cb
| Supermicro
| gpumem=40:NoSchedule
|-
! scope="row"| Asmodeusasmodeus
| all
| Supermicro
| gpumem=40:NoSchedule
|-
! scope="row"| Demogorgondemogorgon
| exc-cb
| Delta
| gpumem=48, gpuarch=nvidia-a40, nvidia-compute-capability-sm80=true
| gpumem=48:NoSchedule
|-
! scope="row"| kiaransalee
| seds
| Delta
| 8 x H100 HGX 640 GB
| gpumem=80, gpuarch=nvidia-h100, nvidia-compute-capability-sm80=true
| gpumem=80:NoSchedule
|-
|}
The CCU name is the internal name used in the Kubernetes cluster, as well as the configured hostname of the node. Nodes are not accessible from the outside world, you have to access the cluster via kubectl through the API-server.
In the column "Access" you can find which Kubernetes user groups can is allowed to access this node. Please only target a specific node if you are allowed to.
{| class="wikitable"
! scope="row"| inf
| Department of Computer Science
|-
! scope="row"| seds
| Social and Economic Data Sciences
|-
! scope="row"| cvia
=== Selecting a node name ===
Example: GPU-enabled pod which runs only on the node "belial". Note that Belial's GPUs have is a more than 20 GBpowerful system, so it is protected by a taint, see table above. Thus, you also have to tolerate the respective taintso that the pod can actually be scheduled on Belial, see which is explained below.
<syntaxhighlight>
</syntaxhighlight>
== Acquiring Targeting more powerful GPUs with more than 20 GB ==
By default, Kubernetes schedules GPU pods only on the smallest class of GPU with 20 GB of memory(nVidia Titan). The way how this is achieved is that nodes with higher grade GPUs are assigned a "node taint", which makes the node only available to pods which specify that they are "tolerant" against the taint.
So if your tasks for example requires a GPU with *exactly* 32 GB, you have to
spec:
nodeSelector:
gpumem=: "32"
tolerations:
- key: "gpumem"
If you need a GPU with *at least* 32 GB, but also would be happy with more, you have just can tolerate any amount. Then,make the pod require the node label "gpumem" tobe larger than 31.
# make the pod tolerate the taint "gpumem=32" Note: typically, you should *not*do this and* "gpumem=40"reserve a GPU which has just enough memory. However, if e.g.# make the pod require the node label "gpumem" all 32 GB GPUs are busy already, you can move up to be larger than 31a 40 GB GPU.
Example:
tolerations:
- key: "gpumem"
# not sure if this works, maybe you need tolerations for all the # different values, each with an "Equal" operator. operator: "GtExists" value: 31
effect: "NoSchedule"
# ... rest of the specs like before
</syntaxhighlight>
ccu
3
edits

Navigation menu