Difference between revisions of "CCU:New GPU Cluster"

Latest revision as of 15:25, 30 January 2021

Overview

In January, the old GPU cluster will gradually be dismantled and integrated into a new Kubernetes cluster. The reason is a massive hardware upgrade of the backbone infrastructure:

New Ceph-based storage cluster with currenly 210TB of NVMe storage to supply all compute nodes with data.
New network backbone: HDR infiniband (200 GB/s).
Triple-redundant servers to supply basic services and serve API requests, so that downtime should be minimized.
As a cherry on top, another GPU server with 4x A100.

Since we reinstall everything from scratch, the usage of the Cluster will also change slightly, both for easier access to storage (getting rid of the somewhat cumbersome need to allocate persistent volumes) and improved security (separate user namespaces).

We first provide a comprehensive list of changes in how to use the cluster, then give a detailed manual for how to move over your data and pods.

Pod configuration on the new cluster

User namespace, pod security and quotas

Each user works in their own namespace now, which is auto-generated when your login is created. The naming convention is as follows:

Login ID : firstname.lastname
Username : firstname-lastname
Namespace: user-firstname-lastname

That means you replace all '.'s in your login ID with a '-' to obtain the username, and prepend "user-" to obtain the namespace.

Thus, you should set your default namespace in the kubeconfig accordingly, and perhaps have to update pod configurations. Note that the security policy for pods is a bit more restrictive as before to detect problematic cases. Please inform me if security policies disrupt your usual workflow so that we can work something out. Also, whenever you feel that you should be able to do a certain thing but are forbidden to do it, please ask if this is intended.

Finally, there is now a mechanism in place to set resource quotas for individual users. The preset is quite generous at the moment since we have plenty of resources, but if you believe your account is too limited, please contact us.

Persistent volume management (or lack thereof)

The ceph storage cluster provides a file system which is mounted on every node in the cluster. Pods are allowed to mount a subset of the filesystem as a host path, see the example pod below. The following directories can be mounted:

/cephfs/abyss/home/<username>: this is your personal home directory which you can use any way you like.
/cephfs/abyss/shared: a shared directory where every user has read/write access everywhere, so your data is not secure here - the intention is to have a quick and dirty method to share results between users. To not have total anarchy in this filesystem, please give sensible names and organize in subdirectories. For example, put personal files which you want to make accessible to everyone in "/abyss/shared/users/<username>". I will monitor how it works out and whether we need more rules here. If you need more private group-based storage to share with a small subset of trusted users, please contact me.
/cephfs/abyss/datasets: directory for static datasets, mounted read-only. These are large general-interest datasets for which we only want to store one copy on the filesystem (no separate imagenets for everyone, please). So whenever you have a well-known public dataset in your shared directory which you think is useful to have in the static tree, please contact me and I move it to the read-only region.

Copy data from the old cluster into the new filesystem

The shared file system can be mounted as an nfs volume on the old cluster, so you can create a pod which mounts both the new filesystem as well as your PVs from the old cluster. Please use the following pod configuration as a template and add additional mounts for the PVs you want to copy over. Note that if you want to copy over local PVs which live on different nodes, then you have to create two different pods, as otherwise the mounts conflict and the pod will forever be pending.

apiVersion: v1
kind: Pod
metadata:
  name: <your-username>-transfer-pod
  namespace: exc-cb
spec:
  # vecna is a good node as it has the fastest connection to the new file system.
  # however, if you have to copy local PVs, then the pod needs to be on the respective node.
  # nodeSelector:
  #   kubernetes.io/hostname: vecna
  containers:
  - name: ubuntu
    image: ubuntu:20.04
    command: ["sleep", "1d"]
    volumeMounts:
      - mountPath: /abyss/shared
        name: cephfs-shared
        readOnly: false
  volumes:
    - name: cephfs-shared
      nfs:
        path: /cephfs/abyss/shared
        server: ccu-node1

Afterwards, run a shell in the container and copy your stuff over to /abyss/shared/users/<your-username>. The following should do the trick. Note that this is not a secure directory as everyone has full read/write access, so copy over to your own home directory on the new cluster as soon as possible.

> kubectl exec -it <your-username>-transfer-pod /bin/bash
# cd /abyss/shared/users/<your-username>
# cp -r <all-my-stuff> ./

Getting started on the new cluster

Login to the new cluster and update your kubeconfig

The frontend for the cluster and login services is located here:

https://ccu-k8s.inf.uni-konstanz.de/

Please choose "login to the cluster" and enter your credentials to obtain the kubeconfig data. Choose "full kubeconfig" on the left for all the details you need. Either backup your old kubeconfig and use this as a new one, or merge them both into a new kubeconfig which allows you to easily switch context between both clusters. In the beginning, this might be useful as you maybe have forgotten some data, and also still need to clean up once everything works.

A kubeconfig for both clusters has the following structure (note this needs to be saved in "~/.kube/config"):

apiVersion: v1
clusters:
- cluster:
    certificate-authority-data: LS0tLS1CRUdJTiBDRV ... <many more characters>
    server: https://134.34.224.84:6443
  name: ccu-old
- cluster:
    certificate-authority-data: LS0tLS1CRUdJTiBDRV ... <many more characters>
    server: https://ccu-k8s.inf.uni-konstanz.de:7443
  name: ccu-new
contexts:
- context:
    cluster: ccu-old
    namespace: exc-cb
    user: credentials-old
  name: ccu-old
- context:
    cluster: ccu-new
    namespace: <your-namespace>
    user: credentials-new
  name: ccu-new
current-context: ccu-new
kind: Config
preferences: {}
users:
- name: credentials-old
  <all the data below your username returned from the old loginapp goes here>
- name: credentials-new
  <all the data below your username returned from the new loginapp goes here>

Both the long CA data string and user credentials are returned from the respective loginapps of the clusters. Note: the CA data is different for both clusters, although the first couple of characters are the same.

If you have created such a kubeconfig for multiple contexts, you can easily switch between the clusters:

> kubectl config use-context ccu-old
> <... work with old cluster>
> kubectl config use-context ccu-new
> <... work with new cluster>

Defining different contexts is also a good way to switch between namespaces or users (which should not be necessary for the average user).

Running the first test container on the new cluster

After login and adjusting the kubeconfig to the new cluster and user namespace, you should be able to start your first pod. The following example pod mounts the ceph filesystems into an Ubuntu container image. Remember to fill in the placeholder <your-username> for your home directory below.

apiVersion: v1
kind: Pod
metadata:
  name: access-pod
spec:
  containers:
  - name: ubuntu
    image: ubuntu:20.04
    command: ["sleep", "1d"]
    resources:
      requests:
        cpu: 100m
        memory: 100Mi
      limits:
        cpu: 1
        memory: 1Gi
    volumeMounts:
      - mountPath: /abyss/home
        name: cephfs-home
        readOnly: false
      - mountPath: /abyss/shared
        name: cephfs-shared
        readOnly: false
      - mountPath: /abyss/datasets
        name: cephfs-datasets
        readOnly: true
  volumes:
    - name: cephfs-home
      hostPath:
        path: "/cephfs/abyss/home/<your-username>"
        type: Directory
    - name: cephfs-shared
      hostPath:
        path: "/cephfs/abyss/shared"
        type: Directory
    - name: cephfs-datasets
      hostPath:
        path: "/cephfs/abyss/datasets"
        type: Directory

Save this e.g. into a "access-pod.yaml", start the pod and verify that it has been created correcly and the filesystems have been mounted successfully, for example with the below commands. You can also check whether you can access the data you have copied over and copy/move it somewhere safe in your private home directory. If you have a large dataset which is probably useful for several people, please contact me so I can move it to the static read-only tree for datasets.

> kubectl apply -f access-pod.yaml
> kubectl get pods
> kubectl describe pod access-pod
> kubectl exec -it access-pod /bin/bash
$ ls /abyss/shared/<the directory you created for your data>

Moving your workloads to the new cluster

You can now verify that you can start a GPU-enabled pod. Try to create a pod with the following specs to allocate 1 GPU for you somewhere on the cluster. The pod comes with an immediately usable installation of Tensorflow 2.0. Note that defining resource requests and limits is now mandatory.

apiVersion: v1
kind: Pod
metadata:
  name: gpu-pod
spec:
  containers:
  - name: gpu-container
    image: nvcr.io/nvidia/tensorflow:20.09-tf2-py3
    command: ["sleep", "1d"]
    resources:
      requests:
        cpu: 1
        nvidia.com/gpu: 1
        memory: 10Gi
      limits:
        cpu: 1
        nvidia.com/gpu: 1
        memory: 10Gi
    volumeMounts:
      - mountPath: /abyss/home
        name: cephfs-home
        readOnly: false
      - mountPath: /abyss/shared
        name: cephfs-shared
        readOnly: false
      - mountPath: /abyss/datasets
        name: cephfs-datasets
        readOnly: true
  volumes:
    - name: cephfs-home
      hostPath:
        path: "/cephfs/abyss/home/<username>"
        type: Directory
    - name: cephfs-shared
      hostPath:
        path: "/cephfs/abyss/shared"
        type: Directory
    - name: cephfs-datasets
      hostPath:
        path: "/cephfs/abyss/datasets"
        type: Directory

Please note (very important): The versions 20.09 of the container images on nvcr.io work on all hosts in the cluster. While there are newer images available, they require drivers >= 455, which are not available for all machines yet. So please stick to 20.09 unless you target a very specific host. I will soon provide a table with driver versions for all hosts once they are upgraded and moved to the new cluster.

You can again switch to a shell in the container and verify GPU capabilities:

> kubectl exec -it gpu-pod -- /bin/bash
# nvidia-smi
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  A100-SXM4-40GB      Off  | 00000000:C1:00.0 Off |                    0 |
| N/A   27C    P0    51W / 400W |      4MiB / 40536MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+

Combine with the volume mounts above, and you already have a working environment. For example, you could transfer some code and data of yours to your home directory, and run it in interactive mode in the container as a quick test. Remember to adjust paths to data sets or to mount the directories in the locations expected by your code.

> kubectl exec -it gpu-pod -- /bin/bash
# cd /abyss/home/<your-code-repo>
# python ./main.py

Note that there are timeouts in place - this is a demo pod which runs only for 24 hours and an interactive session also has a time limit, so it is better to build a custom run script which is executed when the container in the pod starts. A job is a wrapper for a pod spec, which can for example make sure that the pod is restarted until it has at least one successful completion. This is useful for long deep learning work loads, where a pod failure might happen in between (for example due to a node reboot). See Kubernetes docs for pods or jobs for more details.

If you do not have your code ready, you can do a quick test if GPU execution works by running demo code from this tutorial as follows:

> kubectl exec -it gpu-pod -- /bin/bash
# cd /abyss/home
# git clone https://github.com/dragen1860/TensorFlow-2.x-Tutorials.git
# cd TensorFlow-2.x-Tutorials/12_VAE
# ls
README.md  images  main.py  variational_autoencoder.png
# pip3 install pillow matplotlib
# python ./main.py

Cleaning up

Once everything works for you on the new cluster, please clean up your presence on the old one.

In particular:

Delete all running pods
Delete all persistent volume claims. This is the most important step, as it shows me which of the local filesystems of the nodes are not in use anymore, so I can transfer the node over to the new cluster.

Difference between revisions of "CCU:New GPU Cluster"

Latest revision as of 15:25, 30 January 2021

Contents

Overview

Pod configuration on the new cluster

User namespace, pod security and quotas

Persistent volume management (or lack thereof)

Copy data from the old cluster into the new filesystem

Getting started on the new cluster

Login to the new cluster and update your kubeconfig

Running the first test container on the new cluster

Moving your workloads to the new cluster

Cleaning up

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Collective Computational Unit

Mediawiki

Tools

Print/export

@@ Line 1: / Line 1: @@
 == Overview ==
-In January, the old GPU cluster will gradually be dismantled and integrated into a new Kubernetes cluster. The reason is a massive hardware upgrades of the backbone infrastructure:
+In January, the old GPU cluster will gradually be dismantled and integrated into a new Kubernetes cluster. The reason is a massive hardware upgrade of the backbone infrastructure:
 * New Ceph-based storage cluster with currenly 210TB of NVMe storage to supply all compute nodes with data.
@@ Line 24: / Line 24: @@
 That means you replace all '.'s in your login ID with a '-' to obtain the username, and prepend "user-" to obtain the namespace.
-Thus, you should set your default namespace in the kubeconfig accordingly, and perhaps have to update pod configurations. For security reasons, containers are forced to run with your own user id and a group id of "10000". These will also be the ids used to create files and directories, and decide the permissions you have on the file system. The pod security policy which is active for your namespace will automatically fill in this data. Note that the security policy for pods is very restrictive for now to detect all problematic cases. In particular, you can not switch to root inside containers anymore. Please inform me if security policies disrupt your usual workflow so that we can work something out.
+Thus, you should set your default namespace in the kubeconfig accordingly, and perhaps have to update pod configurations. Note that the security policy for pods is a bit more restrictive as before to detect problematic cases. Please inform me if security policies disrupt your usual workflow so that we can work something out. Also, whenever you feel that you should be able to do a certain thing but are forbidden to do it, please ask if this is intended.
-Finally, there is now a mechanism in place to set resource quotas for individual users. The preset is quite generous at the moment since we have plenty of resources, but if you believe your account is too limited, please contact me.
+Finally, there is now a mechanism in place to set resource quotas for individual users. The preset is quite generous at the moment since we have plenty of resources, but if you believe your account is too limited, please contact us.
 === Persistent volume management (or lack thereof) ===
@@ Line 33: / Line 33: @@
 * '''/cephfs/abyss/home/<username>''': this is your personal home directory which you can use any way you like.
-* '''/cephfs/abyss/shared''': a shared directory where every user has read/write access. It's a standard unix filesystem and everyone has an individual user id but is (for now) in the same user group. You can set the permission for files and directories you create accordingly to restrict or allow access. To not have total anarchy in this filesystem, please give sensible names and organize in subdirectories. For example, put personal files which you want to make accessible to everyone in "/abyss/shared/users/<username>". I will monitor how it works out and whether we need more rules here.
+* '''/cephfs/abyss/shared''': a shared directory where every user has read/write access everywhere, so your data is not secure here - the intention is to have a quick and dirty method to share results between users. To not have total anarchy in this filesystem, please give sensible names and organize in subdirectories. For example, put personal files which you want to make accessible to everyone in "/abyss/shared/users/<username>". I will monitor how it works out and whether we need more rules here. If you need more private group-based storage to share with a small subset of trusted users, please contact me.
 * '''/cephfs/abyss/datasets''': directory for static datasets, mounted read-only. These are large general-interest datasets for which we only want to store one copy on the filesystem (no separate imagenets for everyone, please). So whenever you have a well-known public dataset in your shared directory which you think is useful to have in the static tree, please contact me and I move it to the read-only region.
 == Copy data from the old cluster into the new filesystem ==
-The shared file system can be mounted as an nfs volume on the node "Vecna" on the old cluster, so you can create a pod on Vecna which mounts both the new filesystem as well as your PVs from the old cluster. Please use the following pod configuration as a template and add additional mounts for the PVs you want to copy over:
+The shared file system can be mounted as an nfs volume on the old cluster, so you can create a pod which mounts both the new filesystem as well as your PVs from the old cluster. Please use the following pod configuration as a template and add additional mounts for the PVs you want to copy over. Note that if you want to copy over '''local''' PVs which live on different nodes, then you have to create two different pods, as otherwise the mounts conflict and the pod will forever be pending.
 <syntaxhighlight>
@@ Line 47: / Line 47: @@
    namespace: exc-cb
 spec:
-   nodeSelector:
+   # vecna is a good node as it has the fastest connection to the new file system.
-    kubernetes.io/hostname: vecna
+  # however, if you have to copy local PVs, then the pod needs to be on the respective node.
+  # nodeSelector:
+  #   kubernetes.io/hostname: vecna
    containers:
    - name: ubuntu
@@ Line 64: / Line 66: @@
 </syntaxhighlight>
-Afterwards, run a shell in the container and copy your stuff over to /abyss/shared/users/<your-username>. Make sure to set a group ownership id of 10000 with rw permissions for the group (rwx for directories) so you have read/write access on the new cluster. The following should do the trick:
+Afterwards, run a shell in the container and copy your stuff over to /abyss/shared/users/<your-username>. The following should do the trick. Note that this is not a secure directory as everyone has full read/write access, so copy over to your own home directory on the new cluster as soon as possible.
 <syntaxhighlight>
 > kubectl exec -it <your-username>-transfer-pod /bin/bash
-# cd /cephfs/abyss/shared/users/<your-username>
+# cd /abyss/shared/users/<your-username>
 # cp -r <all-my-stuff> ./
-# sudo chmod -R g+w *
 </syntaxhighlight>
@@ Line 118: / Line 119: @@
-Both the long CA data string and user credentials are returned from the respective loginapps of the clusters. Note: the CA data is different for both clusters, although the first couple of characters are the same. If you have such a kubeconfig for multiple contexts, you can easily switch between the clusters:
+Both the long CA data string and user credentials are returned from the respective loginapps of the clusters. Note: the CA data is different for both clusters, although the first couple of characters are the same.
+If you have created such a kubeconfig for multiple contexts, you can easily switch between the clusters:
 <syntaxhighlight>
@@ Line 131: / Line 134: @@
 === Running the first test container on the new cluster ===
-After login and adjusting the kubeconfig to the new cluster and user namespace, you should be able to start your first pod. The following example pod mounts the ceph filesystems into an Ubuntu container image.
+After login and adjusting the kubeconfig to the new cluster and user namespace, you should be able to start your first pod. The following example pod mounts the ceph filesystems into an Ubuntu container image. Remember to fill in the placeholder <your-username> for your home directory below.
 <syntaxhighlight lang="bash">
@@ Line 137: / Line 140: @@
 kind: Pod
 metadata:
-   name: ubuntu-test-pod
+   name: access-pod
 spec:
    containers:
@@ Line 163: / Line 166: @@
      - name: cephfs-home
        hostPath:
-         path: "/cephfs/abyss/home/<user-namespace>"
+         path: "/cephfs/abyss/home/<your-username>"
          type: Directory
      - name: cephfs-shared
@@ Line 179: / Line 182: @@
-Save this into a "test-pod.yaml", start the pod and verify that it has been created correcly and the filesystems have been mounted successfully, for example with the below commands. You can also check whether you can access the data you have copied over and obtain the numeric user- and group-id for filesystem permissions.
+Save this e.g. into a "access-pod.yaml", start the pod and verify that it has been created correcly and the filesystems have been mounted successfully, for example with the below commands. You can also check whether you can access the data you have copied over and copy/move it somewhere safe in your private home directory. If you have a large dataset which is probably useful for several people, please contact me so I can move it to the static read-only tree for datasets.
 <syntaxhighlight lang="bash">
-> kubectl apply -f test-pod.yaml
+> kubectl apply -f access-pod.yaml
 > kubectl get pods
-> kubectl describe pod ubuntu-test-pod
+> kubectl describe pod access-pod
-> kubectl exec -it ubuntu-test-pod /bin/bash
+> kubectl exec -it access-pod /bin/bash
 $ ls /abyss/shared/<the directory you created for your data>
-$ id
-uid=10000 gid=10000 groups=10000
 </syntaxhighlight>
 === Moving your workloads to the new cluster ===
+You can now verify that you can start a GPU-enabled pod. Try to create a pod with the following specs to allocate 1 GPU for you somewhere on the cluster. The pod comes with an immediately usable installation of Tensorflow 2.0. Note that defining resource requests and limits is now mandatory.
+<syntaxhighlight>
+apiVersion: v1
+kind: Pod
+metadata:
+  name: gpu-pod
+spec:
+  containers:
+  - name: gpu-container
+    image: nvcr.io/nvidia/tensorflow:20.09-tf2-py3
+    command: ["sleep", "1d"]
+    resources:
+      requests:
+        cpu: 1
+        nvidia.com/gpu: 1
+        memory: 10Gi
+      limits:
+        cpu: 1
+        nvidia.com/gpu: 1
+        memory: 10Gi
+    volumeMounts:
+      - mountPath: /abyss/home
+        name: cephfs-home
+        readOnly: false
+      - mountPath: /abyss/shared
+        name: cephfs-shared
+        readOnly: false
+      - mountPath: /abyss/datasets
+        name: cephfs-datasets
+        readOnly: true
+  volumes:
+    - name: cephfs-home
+      hostPath:
+        path: "/cephfs/abyss/home/<username>"
+        type: Directory
+    - name: cephfs-shared
+      hostPath:
+        path: "/cephfs/abyss/shared"
+        type: Directory
+    - name: cephfs-datasets
+      hostPath:
+        path: "/cephfs/abyss/datasets"
+        type: Directory
+</syntaxhighlight>
+'''Please note (very important): The versions 20.09 of the container images on nvcr.io work on all hosts in the cluster. While there are newer images available, they require drivers >= 455, which are not available for all machines yet. So please stick to 20.09 unless you target a very specific host.''' I will soon provide a table with driver versions for all hosts once they are upgraded and moved to the new cluster.
+You can again switch to a shell in the container and verify GPU capabilities:
+<syntaxhighlight>
+> kubectl exec -it gpu-pod -- /bin/bash
+# nvidia-smi
++-----------------------------------------------------------------------------+
+| NVIDIA-SMI 460.32.03    Driver Version: 460.32.03    CUDA Version: 11.2     |
+|-------------------------------+----------------------+----------------------+
+| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
+| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
+|                               |                      |               MIG M. |
+|===============================+======================+======================|
+|   0  A100-SXM4-40GB      Off  | 00000000:C1:00.0 Off |                    0 |
+| N/A   27C    P0    51W / 400W |      4MiB / 40536MiB |      0%      Default |
+|                               |                      |             Disabled |
++-------------------------------+----------------------+----------------------+
+</syntaxhighlight>
+Combine with the volume mounts above, and you already have a working environment. For example, you could transfer some code and data of yours to your home directory, and run it in interactive mode in the container as a quick test. Remember to adjust paths to data sets or to mount the directories in the locations expected by your code.
+<syntaxhighlight>
+> kubectl exec -it gpu-pod -- /bin/bash
+# cd /abyss/home/<your-code-repo>
+# python ./main.py
+</syntaxhighlight>
+Note that there are timeouts in place - this is a demo pod which runs only for 24 hours and an interactive session also has a time limit, so it is better to build a custom run script which is executed when the container in the pod starts. A job is a wrapper for a pod spec, which can for example make sure that the pod is restarted until it has at least one successful completion. This is useful for long deep learning work loads, where a pod failure might happen in between (for example due to a node reboot). See [https://kubernetes.io/docs/concepts/workloads/pods/ Kubernetes docs for pods] or [https://kubernetes.io/docs/concepts/workloads/controllers/job/ jobs] for more details.
+If you do not have your code ready, you can do a quick test if GPU execution works by running demo code from [https://github.com/dragen1860/TensorFlow-2.x-Tutorials this tutorial] as follows:
+<syntaxhighlight>
+> kubectl exec -it gpu-pod -- /bin/bash
+# cd /abyss/home
+# git clone https://github.com/dragen1860/TensorFlow-2.x-Tutorials.git
+# cd TensorFlow-2.x-Tutorials/12_VAE
+# ls
+README.md  images  main.py  variational_autoencoder.png
+# pip3 install pillow matplotlib
+# python ./main.py
+</syntaxhighlight>
 === Cleaning up ===