Changes

Jump to navigation Jump to search

CCU:New GPU Cluster

2,717 bytes removed, 4 years ago
m
no edit summary
We first provide a comprehensive list of changes in how to use the cluster, then give a detailed manual for how to move over your data and pods.
 == Pod configuration on the new cluster == === User namespace, pod security and quotas ===
Each user works in their own namespace now, which is auto-generated when your login is created. The naming convention is "firstname-lastname", i.e. you replace all '.'s in your cluster username with '-'. Thus, you need to update your default namespace in the kubeconfig. For security reasons, containers must run with your user id and your user group. To make configuration easy, a pod preset which sets all required options (in addition to mounting basic filesystems) is available in your namespace, see examples below for details.
Finally, there is now a mechanism in place to set resource quotas for individual users. The preset is quite generous at the moment since we have plenty of resources, but if you believe your account is too limited, please contact me.
=== Persistent volume management (or lack thereof) ===
The ceph storage cluster provides a file system which is mounted on every node in the cluster. Pods are allowed to mount a subset of the filesystem as a host path, which is done automatically if you use the preconfigured pod preset in your namespace, see below. The following directories will be mounted in each of your containers:
* /abyss/shared: a shared directory where every user has read/write access. It's a standard unix filesystem and everyone has an individual user id but is (for now) in the same user group. Thus, you can set the usual file access permission for directories you create. To not have total anarchy in this filesystem, please give sensible names and organize in subdirectories. For example, put personal files which you want to make accessible to everyone in "/abyss/shared/users/<your-namespace>". I will monitor how it works out and whether we need more rules here.
* /abyss/datasets: directory for static datasets, mounted read-only. These are large general-interest datasets for which we only want to store one copy on the filesystem (no separate imagenets for everyone, please). So whenever you have a well-known public dataset in your shared directory which you think is useful to have in the static tree, please contact me and I move it to the read-only region.
 
== Getting started on the new cluster == === Login to the new cluster and update your kubeconfig ===
The frontend for the cluster and login services is located here:
Please follow instructions there to obtain credentials and cluster data for your kubeconfig.
=== Running the first test container on the new cluster == == Moving your workloads to the new cluster ==       == Compute nodes == See [[Cluster:Compute_nodes|this page]] for a current list of compute nodes, their hardware, and which groups they serve. == What you need == * An account for the CCU.* Ideally, a desktop PC with an nVidia GPU to test your code before pushing it to the cluster. However, you can develop for and control the cluster on any machine, it's not mandatory that you can actually run the code locally. Note, however, that it makes debugging harder if you cannot do this (you have to do everything on the console).* Your PC ideally runs a flavor of Linux, all example scripts were tested against Ubuntu 18.04 (should also work on derivatives, such as Mint 19). If you use Windows, you are on your own.* Admin access to your own PC to install lots of stuff (or a friendly administrator).* More specific needs will be covered in the in-depth tutorials.  == How to get started == * Preparing your system** Step 1: [[Tutorials:Install nVidia CUDA and GPU drivers|Install nVidia CUDA and GPU drivers]]** Step 2: [[Tutorials:Install the nVidia docker system|Install the nVidia docker system]]** Step 3: [[Tutorials:Link to container registry on our server|Link to container registry on our server]]** For the impatient: [[Tutorials:Complete install script for a fresh Ubuntu 18.04|Complete install script for a fresh Ubuntu 18.04]] * Learning the basics of Docker** An in-depth look at a [[Example:container which trains MNIST using Tensorflow|container which trains MNIST using Tensorflow]], with the following steps:*** Step 1: create a local python tensorflow application.*** Step 2: wrap the application in a container.*** Step 3: run and test the container locally.*** Step 4: push the container to the registry server of the cluster.*** Step 5: remarks on persistent storage in docker containers * Learning the basics of Kubernetes and how to run jobs on the cluster:** Step 1: [[Tutorials:Install the Kubernetes infrastructure|Install the Kubernetes infrastructure]]** Step 2: [[Tutorials:Set up your Kubernetes user account|Set up your Kubernetes user account]]** Step 3: [[Tutorials:Run the example container on the cluster|Run the example container on the cluster]] and make sure that it works correctly.** Step 4: [[Tutorials:Persistent volumes on the Kubernetes cluster|Persistent volumes on the GPU cluster]]** Step 5: [[Tutorials:Monitoring with Tensorboard on the GPU cluster|Monitoring with Tensorboard on the GPU cluster]]  == Tips and Tricks == ** [[Tips:How to ensure your pod ends up on a specific node|How to ensure your pod ends up on a specific node]]  == Reference documents ==
* [[Cluster:Namespaces|Which namespaces am I allowed === Moving your workloads to use?]]* [[Cluster:Nodes|Which compute nodes are available?]]* [[Cluster:Nodes|Which namespace has access to which compute nodes?]]the new cluster ===

Navigation menu