Changes

← Older edit

CCU:GPU Cluster Quick Start

2,488 bytes added, 11 months ago

m

→‎Running actual workloads on the cluster

== Log in to the cluster and configure kubectl ==

You first need a working version of kubectl on your system. The cluster runs Kubernetes 1.2028.12, the version of kubectl should match this. Check out installation instructions in the [https://kubernetes.io/docs/tasks/tools/install-kubectl/ official Kubernetes documentation].

The login page to the cluster is [https://ccu-k8s.inf.uni-konstanz.de here]. Enter your credentials, you will get back an authorization token. Click on "full kubeconfig" on the left, and copy the content of this to a new file named ".kube/config" in your home directory. Note that the default namespace still has the template name "user-<firstname>-<lastname>". Replace this text with your username, so that your kubeconfig looks like this:

== Create a pod to access the file systems ==

After login and adjusting the kubeconfig to the new cluster and user namespace, you should be able to start your first pod. Create a work directory on your machine, and a file "~~access~~ubuntu-test-pod.yaml" with the following content:

> kubectl apply -f ~~access~~ubuntu-test-pod.yaml

> kubectl get pods

> kubectl describe pod ~~access~~ubuntu-test-pod

</syntaxhighlight>

> kubectl exec -it ~~access~~ubuntu-test-pod -- /bin/bash

# cd /abyss/home/

# ls

> kubectl cp <my-files> ~~access~~ubuntu-test-pod:/abyss/home/

</syntaxhighlight>

This works also in the other direction to get stuff out of the pod. For more ideas for what you can do with kubectl, which is a powerful and complex tool, please refer to the basic [https://kubernetes.io/docs/reference/kubectl/cheatsheet/ kubectl cheat sheet] or

a more [https://github.com/dennyzhang/cheatsheet-kubernetes-A4 advanced version here].

* '''/cephfs/abyss/home/<your-username>''': this is your personal home directory which you can use any way you like.

* '''/cephfs/abyss/shared''': a shared directory where every user has read/write access~~. It's a standard unix filesystem and everyone has an individual user id but~~ , so your data is ~~(for now) in the same user group. You can set the permission for files and directories you create accordingly to restrict~~ not secure here from manipulation or ~~allow access~~deletion. To not have total anarchy in this filesystem, please give sensible names and organize in subdirectories. For example, put personal files which you want to make accessible to everyone in "/abyss/shared/users/<username>". Be considerate towards other users. I will monitor how it works out and whether we need more rules here. If you need more private storage shared only between all members of a trusted work group, please contact me.

* '''/cephfs/abyss/datasets''': directory for static datasets, mounted read-only. These are large general-interest datasets for which we only want to store one copy on the filesystem (no separate imagenets for everyone, please). So whenever you have a well-known public dataset in your shared directory which you think is useful to have in the static tree, please contact me and I move it to the read-only region.

In addition, you can use a directory local to each host, which depending on your workload might be much faster than cephfs, but also ties you to a specific machine:

* '''/raid/local-data/<your-username>''': your personal directory on the local SSD raid of the machine. Make sure to set "type: DirectoryOrCreate", at it is not guaranteed to exist yet.

Please refer to [[CCU:Perstistent storage on the Kubernetes cluster|the persistent storage documentation]] for more details.

== Running actual workloads on the cluster ==

</syntaxhighlight>

See [https://www.nvidia.com/en-us/gpu-cloud/containers/ the catalog of containers by nVidia] for more options for base images (e.g. [https://ngc.nvidia.com/catalog/containers/nvidia:pytorch PyTorch]), or Google aroundfor containers of your favourite application.'''Make sure you only run containers from trusted sources!'''

'''Please note (very important): The versions 20.09 of the ~~container images~~ deep learning frameworks on nvcr.io work on all hosts in the cluster. While there are newer images available, they require drivers >= 455, which are not available for all machines yet. ~~So please~~ For guaranteed compability, you must stick to 20.09 ~~unless~~ , but you can target a ~~very~~ specific host~~. Only run containers from trusted sources~~with newer drivers.''' I will soon provide a table with driver versions for all hosts once they are upgraded and moved to the new cluster. As a general rule, everything which is made for Cuda 11.0 and driver version >= 450 should work fine on the Cluster.

At the bottom of the GPU cluster status page, there is the nvidia-smi output for each node, where you can check individual driver and CUDA version. You can ~~again~~ also switch to a shell in the container and verify GPU capabilities:

> kubectl apply -f gpu-pod.yaml... wait until pod is created, check with "kubectl describe pod gpu-pod" or "kubectl get pods"> kubectl exec -it gpu-pod -- /bin/bash

# nvidia-smi

+-----------------------------------------------------------------------------+

+-------------------------------+----------------------+----------------------+

</syntaxhighlight>

To check compabitility with specific nVidia containers, please refer to the [https://docs.nvidia.com/deeplearning/frameworks/support-matrix/index.html official compatibility matrix]. Note that all nodes have datacenter drivers installed, which should give a large amount of compability. If in doubt, just try it out.

Combine with the volume mounts above, and you already have a working environment. For example, you could transfer some code and data of yours to your home directory, and run it in interactive mode in the container as a quick test. Remember to adjust paths to data sets or to mount the directories in the locations expected by your code.

</syntaxhighlight>

Remember to clean up resources which you are not using anymore, this includes pods and jobs. For example, when your pod has finished what ever it is supposed to be doing, run

> kubectl delete -f gpu-pod.yaml

</syntaxhighlight>

using the same manifest file you used to create the resource with kubectl apply.

== Targeting specific nodes and GPU capabilities ==

By default, your pods will be scheduled on the lowest class of GPUs (in terms of memory available, they are mostly still quite decent). Please refer to

[[Cluster:Compute nodes|the documentation on compute nodes]] for information how to target different nodes with higher capability.

== Accessing ports on the pod from your own system ==

Some monitoring tools for deep learning use ports on the pod to convey information via a browser interface, an example being Tensorboard. You can forward these ports to your own local host using kubectl as a proxy. Follow the [https://kubernetes.io/docs/tasks/access-application-cluster/port-forward-access-application-cluster/ tutorial here] to learn how it works. ~~Then~~Syntax for port-forwarding: <syntaxhighlight>> kubectl port-forward <pod-name> <dest-port>:<source-port></syntaxhighlight> kubectl will now continue running as a proxy. While it is running, you can access the pod ~~services~~ service on "localhost:<dest-port~~-you-forwarded-to~~>"in the browser on your own machine. You could even create containers which provide interactive environments via a web interface, e.g. a Jupyter notebook server. == Create, push and pull docker images to and from the CCU repository == Please follow our tutorial on how to create, push and pull docker images to and from our CCU repository: * [[Tutorials:Link_to_container_registry_on_our_server | How to use the CCU image repository]] == Mount your custom, or Data Management Plan (DMP) provided, cifs storage == * [[Tutorials:Mount_cifs_storage_in_a_pod | How to mount cifs storage]]

Bastian.goldluecke

ccu, Administrators

684

edits

Changes

CCU:GPU Cluster Quick Start

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Collective Computational Unit

Mediawiki

Tools