Difference between revisions of "CCU:GPU Cluster"
m |
m (→How to get started) |
||
| (17 intermediate revisions by 2 users not shown) | |||
| Line 11: | Line 11: | ||
This means for you that you have to be able to take the necessary steps to wrap your own code into a container. All this is covered in an easy, introductory way in the short tutorials below, which should be sufficient to get you started. At some point, you might want to learn about docker in a more in-depth manner, for this, I refer you to the excellent tutorials available elsewhere, some of which are linked [[CCU:tutorials|here]]. | This means for you that you have to be able to take the necessary steps to wrap your own code into a container. All this is covered in an easy, introductory way in the short tutorials below, which should be sufficient to get you started. At some point, you might want to learn about docker in a more in-depth manner, for this, I refer you to the excellent tutorials available elsewhere, some of which are linked [[CCU:tutorials|here]]. | ||
| + | == Compute nodes == | ||
| − | + | See [[Cluster:Compute_nodes|this page]] for a current list of compute nodes, their hardware, and which groups they serve. | |
| − | + | == Changelog == | |
| + | See [[Cluster:Changelog|this page]] for a history of cluster updates and necessary changes on your side. | ||
== What you need == | == What you need == | ||
| Line 21: | Line 23: | ||
* An account for the CCU. | * An account for the CCU. | ||
* Ideally, a desktop PC with an nVidia GPU to test your code before pushing it to the cluster. However, you can develop for and control the cluster on any machine, it's not mandatory that you can actually run the code locally. Note, however, that it makes debugging harder if you cannot do this (you have to do everything on the console). | * Ideally, a desktop PC with an nVidia GPU to test your code before pushing it to the cluster. However, you can develop for and control the cluster on any machine, it's not mandatory that you can actually run the code locally. Note, however, that it makes debugging harder if you cannot do this (you have to do everything on the console). | ||
| − | * Your PC ideally runs a flavor of Linux, all example scripts were tested against Ubuntu | + | * Your PC ideally runs a flavor of Linux, all example scripts were tested against Ubuntu 20.04 (should also work on derivatives, such as Linux Mint, based on that Ubuntu edition). If you use Windows, you are on your own. |
* Admin access to your own PC to install lots of stuff (or a friendly administrator). | * Admin access to your own PC to install lots of stuff (or a friendly administrator). | ||
* More specific needs will be covered in the in-depth tutorials. | * More specific needs will be covered in the in-depth tutorials. | ||
== How to get started == | == How to get started == | ||
| + | |||
| + | Most up to date information for the current cluster: | ||
| + | |||
| + | * [[CCU:GPU_Cluster_Quick_Start | Quick start tutorial]] | ||
| + | * [[CCU:Perstistent storage on the Kubernetes cluster | How to use persistent storage]] | ||
| + | * [[Tutorials:Link_to_container_registry_on_our_server | How to use the CCU image repository]] | ||
| + | * [[Tutorials:Mount_cifs_storage_in_a_pod | How to mount cifs storage]] | ||
| + | * [[Cluster:Compute nodes | How to target different compute nodes]] | ||
| + | |||
| + | |||
| + | The following information is partially outdated, and refers to older system architectures (Ubuntu 18.04). In particular, the install scripts probably do not work anymore. Instead, refer to current online documentation on how to install e.g. nvidia drivers and nvidia docker. | ||
* Preparing your system | * Preparing your system | ||
** Step 1: [[Tutorials:Install nVidia CUDA and GPU drivers|Install nVidia CUDA and GPU drivers]] | ** Step 1: [[Tutorials:Install nVidia CUDA and GPU drivers|Install nVidia CUDA and GPU drivers]] | ||
** Step 2: [[Tutorials:Install the nVidia docker system|Install the nVidia docker system]] | ** Step 2: [[Tutorials:Install the nVidia docker system|Install the nVidia docker system]] | ||
| − | |||
** For the impatient: [[Tutorials:Complete install script for a fresh Ubuntu 18.04|Complete install script for a fresh Ubuntu 18.04]] | ** For the impatient: [[Tutorials:Complete install script for a fresh Ubuntu 18.04|Complete install script for a fresh Ubuntu 18.04]] | ||
| − | * Learning the basics of Docker | + | |
| + | These tutorials should still work: | ||
| + | |||
| + | * Learning the basics of Docker (requires docker or nvidia-docker for the GPU containers) | ||
| + | ** [[Tutorials:Link to container registry on our server|Link to container registry on our server]] | ||
** An in-depth look at a [[Example:container which trains MNIST using Tensorflow|container which trains MNIST using Tensorflow]], with the following steps: | ** An in-depth look at a [[Example:container which trains MNIST using Tensorflow|container which trains MNIST using Tensorflow]], with the following steps: | ||
*** Step 1: create a local python tensorflow application. | *** Step 1: create a local python tensorflow application. | ||
| Line 45: | Line 61: | ||
** Step 2: [[Tutorials:Set up your Kubernetes user account|Set up your Kubernetes user account]] | ** Step 2: [[Tutorials:Set up your Kubernetes user account|Set up your Kubernetes user account]] | ||
** Step 3: [[Tutorials:Run the example container on the cluster|Run the example container on the cluster]] and make sure that it works correctly. | ** Step 3: [[Tutorials:Run the example container on the cluster|Run the example container on the cluster]] and make sure that it works correctly. | ||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
Latest revision as of 08:08, 4 July 2024
Overview
The CCU provides access to state-of-the-art hardware infrastructure to run GPU-accelerated machine learning applications. This page gives a general overview and links to more in-depth tutorials on how to work with the cluster. There is some overhead involved when writing code for your projects and you have to stick to a few guidelines, but there are template projects and scripts provided so that you can get started with minimal knowledge about the technical background of the GPU cluster.
The GPU cluster is based on Kubernetes, which is a framework to deploy so-called Docker containers to different compute nodes. You can think of a Docker container as a wrapper for your machine learning application, which contains all necessary code and all the libraries it depends on (yes, also the ones from the basic OS). In essence, it is an independent object which can be deployed and run on an arbitrary computer on which the docker infrastructure is installed.
This means for you that you have to be able to take the necessary steps to wrap your own code into a container. All this is covered in an easy, introductory way in the short tutorials below, which should be sufficient to get you started. At some point, you might want to learn about docker in a more in-depth manner, for this, I refer you to the excellent tutorials available elsewhere, some of which are linked here.
Compute nodes
See this page for a current list of compute nodes, their hardware, and which groups they serve.
Changelog
See this page for a history of cluster updates and necessary changes on your side.
What you need
- An account for the CCU.
- Ideally, a desktop PC with an nVidia GPU to test your code before pushing it to the cluster. However, you can develop for and control the cluster on any machine, it's not mandatory that you can actually run the code locally. Note, however, that it makes debugging harder if you cannot do this (you have to do everything on the console).
- Your PC ideally runs a flavor of Linux, all example scripts were tested against Ubuntu 20.04 (should also work on derivatives, such as Linux Mint, based on that Ubuntu edition). If you use Windows, you are on your own.
- Admin access to your own PC to install lots of stuff (or a friendly administrator).
- More specific needs will be covered in the in-depth tutorials.
How to get started
Most up to date information for the current cluster:
- Quick start tutorial
- How to use persistent storage
- How to use the CCU image repository
- How to mount cifs storage
- How to target different compute nodes
The following information is partially outdated, and refers to older system architectures (Ubuntu 18.04). In particular, the install scripts probably do not work anymore. Instead, refer to current online documentation on how to install e.g. nvidia drivers and nvidia docker.
- Preparing your system
- Step 1: Install nVidia CUDA and GPU drivers
- Step 2: Install the nVidia docker system
- For the impatient: Complete install script for a fresh Ubuntu 18.04
These tutorials should still work:
- Learning the basics of Docker (requires docker or nvidia-docker for the GPU containers)
- Link to container registry on our server
- An in-depth look at a container which trains MNIST using Tensorflow, with the following steps:
- Step 1: create a local python tensorflow application.
- Step 2: wrap the application in a container.
- Step 3: run and test the container locally.
- Step 4: push the container to the registry server of the cluster.
- Step 5: remarks on persistent storage in docker containers
- Learning the basics of Kubernetes and how to run jobs on the cluster:
- Step 1: Install the Kubernetes infrastructure
- Step 2: Set up your Kubernetes user account
- Step 3: Run the example container on the cluster and make sure that it works correctly.