Changes

← Older edit

CCU:GPU Cluster Quick Start

563 bytes added, 11 months ago

m

→‎Running actual workloads on the cluster

See [https://www.nvidia.com/en-us/gpu-cloud/containers/ the catalog of containers by nVidia] for more options for base images (e.g. [https://ngc.nvidia.com/catalog/containers/nvidia:pytorch PyTorch]), or Google around for containers of your favourite application. '''Make sure you only run containers from trusted sources!'''

'''Please note (very important): The versions 20.09 of the deep learning frameworks on nvcr.io work on all hosts in the cluster. While there are newer images available, they require drivers >= 455, which are not available for all machines yet. ~~So please~~ For guaranteed compability, you must stick to 20.09 ~~unless~~ , but you can target a ~~very~~ specific hostwith newer drivers.''' I will soon provide a table with driver versions for all hosts once they are upgraded and moved to the new cluster. As a general rule, everything which is made for Cuda 11.0 and driver version >= 450 should work fine on the Cluster. However, older versions of the images on nvcr.io which run for example CUDA 10.2 still work, if your code requires an older version of CUDA.

At the bottom of the GPU cluster status page, there is the nvidia-smi output for each node, where you can check individual driver and CUDA version. You can ~~again~~ also switch to a shell in the container and verify GPU capabilities:

+-------------------------------+----------------------+----------------------+

</syntaxhighlight>

To check compabitility with specific nVidia containers, please refer to the [https://docs.nvidia.com/deeplearning/frameworks/support-matrix/index.html official compatibility matrix]. Note that all nodes have datacenter drivers installed, which should give a large amount of compability. If in doubt, just try it out.

Combine with the volume mounts above, and you already have a working environment. For example, you could transfer some code and data of yours to your home directory, and run it in interactive mode in the container as a quick test. Remember to adjust paths to data sets or to mount the directories in the locations expected by your code.

kubectl will now continue running as a proxy. While it is running, you can access the pod service on "localhost:<dest-port>" in the browser on your own machine. You could even create containers which provide interactive environments via a web interface, e.g. a Jupyter notebook server.

== Create, push and pull docker images to and from the CCU repository ==

Please follow our tutorial on how to create, push and pull docker images to and from our CCU repository:

* [[Tutorials:Link_to_container_registry_on_our_server | How to use the CCU image repository]]

== Mount your custom, or Data Management Plan (DMP) provided, cifs storage ==

* [[Tutorials:Mount_cifs_storage_in_a_pod | How to mount cifs storage]]

Bastian.goldluecke

ccu, Administrators

684

edits

Changes

CCU:GPU Cluster Quick Start

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Collective Computational Unit

Mediawiki

Tools