Changes

Jump to navigation Jump to search

CCU:GPU Cluster Quick Start

2,894 bytes removed, 4 years ago
m
no edit summary
> kubectl get pods
> kubectl describe pod access-pod
> kubectl exec -it access-pod /bin/bash
$ ls /abyss/home/
</syntaxhighlight>
<syntaxhighlight lang="bash">
> kubectl exec -it access-pod -- /bin/bash
# cd /abyss/home/
# ls
<might already contain stuff which was automatically copied from volumes on the old cluster.
#
</syntaxhighlight>
From within the container, you have access to the internet, can install packages which are still missing, and also copy over your code and data via rsync or pulling it with e.g. git or svn. You can also push stuff into the container from your local machine using kubectl.
<syntaxhighlight lang== Pod configuration on the new cluster == === User namespace, pod security and quotas === Each user works in their own namespace now, which is auto-generated when your login is created. The naming convention is as follows: * Login ID : firstname.lastname* Username : firstname-lastname* Namespace: user-firstname-lastname That means you replace all '.'s in your login ID with a '-' to obtain the username, and prepend "user-bash" to obtain the namespace. > Thus, you should set your default namespace in the kubeconfig accordingly, and perhaps have to update pod configurations. For security reasons, containers are forced to run with your own user id and a group id of "10000". These will also be the ids used to create > kubectl cp <my-files and directories, and decide the permissions you have on the file system. The > access-pod security policy which is active for your namespace will automatically fill in this data. Note that the security policy for pods is very restrictive for now to detect all problematic cases. In particular, you can not switch to root inside containers anymore. Please inform me if security policies disrupt your usual workflow so that we can work something out.:/abyss/home/Finally, there is now a mechanism in place to set resource quotas for individual users. The preset is quite generous at the moment since we have plenty of resources, but if you believe your account is too limited, please contact me.</syntaxhighlight>
=== Persistent volume management (For more ideas for what you can do with kubectl, which is a powerful and complex tool, please refer to the basic [https://kubernetes.io/docs/reference/kubectl/cheatsheet/ kubectl cheat sheet] or lack thereof) ===a more [https://github.com/dennyzhang/cheatsheet-kubernetes-A4 advanced version here].
The ceph storage cluster provides a file system which is mounted systems you are mounting into the pod are available on every node in the cluster. Pods are allowed to mount a subset of the filesystem as a host path, see the example pod below. The following directories can be mountedused by anyone:
* '''/cephfs/abyss/home/<username>''': this is your personal home directory which you can use any way you like.
* '''/cephfs/abyss/datasets''': directory for static datasets, mounted read-only. These are large general-interest datasets for which we only want to store one copy on the filesystem (no separate imagenets for everyone, please). So whenever you have a well-known public dataset in your shared directory which you think is useful to have in the static tree, please contact me and I move it to the read-only region.
== Copy data from the old cluster into the new filesystem ==
The shared file system can be mounted as an nfs volume on the node "Vecna" on the old cluster, so you can create a pod on Vecna which mounts both the new filesystem as well as your PVs from the old cluster. Please use the following pod configuration as a template and add additional mounts for the PVs you want to copy over:
<syntaxhighlight>apiVersion: v1kind: Podmetadata: name: <your-username>-transfer-pod namespace: exc-cbspec: nodeSelector: kubernetes.io/hostname: vecna containers: - name: ubuntu image: ubuntu:20.04 command: ["sleep", "1d"] volumeMounts: - mountPath: /abyss/shared name: cephfs-shared readOnly: false volumes: - name: cephfs-shared nfs: path: /cephfs/abyss/shared server: ccu-node1</syntaxhighlight> Afterwards, run a shell in the container and copy your stuff over to /abyss/shared/users/<your-username>. Make sure to set a group ownership id of 10000 with rw permissions for the group (rwx for directories) so you have read/write access on the new cluster. The following should do the trick: <syntaxhighlight>> kubectl exec -it <your-username>-transfer-pod /bin/bash# cd /cephfs/abyss/shared/users/<your-username># cp -r <all-my-stuff> ./# chgrp -R 10000 *# chown -R 10000 * (replace with your real user ID if you already know it from logging into the new cluster, see below)# chmod -R g+w *</syntaxhighlight> == Getting started Running actual workloads on the new cluster ==  === Moving your workloads to the new cluster ===
You can now verify that you can start a GPU-enabled pod. Try to create a pod with the following specs to allocate 1 GPU for you somewhere on the cluster.
Combine with the volume mounts above, and you already have a working environment. For example, you could transfer some code and data of yours to your home directory, and run it in interactive mode in the container as a quick test. Note that there are timeouts in place and an interactive session does not last forever, so it is better to build a custom run script which is executed when the container in the pod starts. See the documentation for more details. TODO: link to respective doc.
 
=== Cleaning up ===
 
Once everything works for you on the new cluster, please clean up your presence on the old one.
 
In particular:
 
* Delete all running pods
* Delete all persistent volume claims. This is the most important step, as it shows me which of the local filesystems of the nodes are not in use anymore, so I can transfer the node over to the new cluster.

Navigation menu