Changes

← Older edit

CCU:GPU Cluster Quick Start

1,382 bytes added, 11 months ago

m

→‎Running actual workloads on the cluster

== Log in to the cluster and configure kubectl ==

You first need a working version of kubectl on your system. The cluster runs Kubernetes 1.2028.12, the version of kubectl should match this. Check out installation instructions in the [https://kubernetes.io/docs/tasks/tools/install-kubectl/ official Kubernetes documentation].

The login page to the cluster is [https://ccu-k8s.inf.uni-konstanz.de here]. Enter your credentials, you will get back an authorization token. Click on "full kubeconfig" on the left, and copy the content of this to a new file named ".kube/config" in your home directory. Note that the default namespace still has the template name "user-<firstname>-<lastname>". Replace this text with your username, so that your kubeconfig looks like this:

* '''/cephfs/abyss/shared''': a shared directory where every user has read/write access, so your data is not secure here from manipulation or deletion. To not have total anarchy in this filesystem, please give sensible names and organize in subdirectories. For example, put personal files which you want to make accessible to everyone in "/abyss/shared/users/<username>". Be considerate towards other users. I will monitor how it works out and whether we need more rules here. If you need more private storage shared only between all members of a trusted work group, please contact me.

* '''/cephfs/abyss/datasets''': directory for static datasets, mounted read-only. These are large general-interest datasets for which we only want to store one copy on the filesystem (no separate imagenets for everyone, please). So whenever you have a well-known public dataset in your shared directory which you think is useful to have in the static tree, please contact me and I move it to the read-only region.

In addition, you can use a directory local to each host, which depending on your workload might be much faster than cephfs, but also ties you to a specific machine:

* '''/raid/local-data/<your-username>''': your personal directory on the local SSD raid of the machine. Make sure to set "type: DirectoryOrCreate", at it is not guaranteed to exist yet.

Please refer to [[CCU:Perstistent storage on the Kubernetes cluster|the persistent storage documentation]] for more details.

== Running actual workloads on the cluster ==

See [https://www.nvidia.com/en-us/gpu-cloud/containers/ the catalog of containers by nVidia] for more options for base images (e.g. [https://ngc.nvidia.com/catalog/containers/nvidia:pytorch PyTorch]), or Google around for containers of your favourite application. '''Make sure you only run containers from trusted sources!'''

'''Please note (very important): The versions 20.09 of the deep learning frameworks on nvcr.io work on all hosts in the cluster. While there are newer images available, they require drivers >= 455, which are not available for all machines yet. ~~So please~~ For guaranteed compability, you must stick to 20.09 ~~unless~~ , but you can target a ~~very~~ specific hostwith newer drivers.''' I will soon provide a table with driver versions for all hosts once they are upgraded and moved to the new cluster. As a general rule, everything which is made for Cuda 11.0 and driver version >= 450 should work fine on the Cluster. However, older versions of the images on nvcr.io which run for example CUDA 10.2 still work, if your code requires an older version of CUDA.

At the bottom of the GPU cluster status page, there is the nvidia-smi output for each node, where you can check individual driver and CUDA version. You can ~~again~~ also switch to a shell in the container and verify GPU capabilities:

> kubectl apply -f gpu-pod.yaml

... wait until pod is created, check with "kubectl describe pod gpu-pod" or "kubectl get pods"

> kubectl exec -it gpu-pod -- /bin/bash

# nvidia-smi

+-----------------------------------------------------------------------------+

+-------------------------------+----------------------+----------------------+

</syntaxhighlight>

To check compabitility with specific nVidia containers, please refer to the [https://docs.nvidia.com/deeplearning/frameworks/support-matrix/index.html official compatibility matrix]. Note that all nodes have datacenter drivers installed, which should give a large amount of compability. If in doubt, just try it out.

Combine with the volume mounts above, and you already have a working environment. For example, you could transfer some code and data of yours to your home directory, and run it in interactive mode in the container as a quick test. Remember to adjust paths to data sets or to mount the directories in the locations expected by your code.

using the same manifest file you used to create the resource with kubectl apply.

== Targeting specific nodes and GPU capabilities ==

By default, your pods will be scheduled on the lowest class of GPUs (in terms of memory available, they are mostly still quite decent). Please refer to

[[Cluster:Compute nodes|the documentation on compute nodes]] for information how to target different nodes with higher capability.

== Accessing ports on the pod from your own system ==

kubectl will now continue running as a proxy. While it is running, you can access the pod service on "localhost:<dest-port>" in the browser on your own machine. You could even create containers which provide interactive environments via a web interface, e.g. a Jupyter notebook server.

== Create, push and pull docker images to and from the CCU repository ==

Please follow our tutorial on how to create, push and pull docker images to and from our CCU repository:

* [[Tutorials:Link_to_container_registry_on_our_server | How to use the CCU image repository]]

== Mount your custom, or Data Management Plan (DMP) provided, cifs storage ==

* [[Tutorials:Mount_cifs_storage_in_a_pod | How to mount cifs storage]]

Bastian.goldluecke

ccu, Administrators

684

edits

Changes

CCU:GPU Cluster Quick Start

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Collective Computational Unit

Mediawiki

Tools