Difference between revisions of "Tutorials:Run the example container on the cluster"
m (→Checking in on the container) |
m (→Set up a Kubernetes job script) |
||
| Line 29: | Line 29: | ||
# container resources requested from the node | # container resources requested from the node | ||
resources: | resources: | ||
| − | # limits are | + | # requests are minimum resourcerequirements |
| + | requests: | ||
| + | # this gives us a minimum 2 GiB of main memory to work with. | ||
| + | memory: "2Gi" | ||
| + | |||
| + | # limits are maximum resource allocations | ||
limits: | limits: | ||
| − | # this gives | + | # this gives an absolute limit of 3 GiB of main memory. |
# exceeding it will mean the container exits immediately with an error. | # exceeding it will mean the container exits immediately with an error. | ||
| − | memory: " | + | memory: "3Gi" |
# this requests a number of GPUs. GPUs will be allocated to the container | # this requests a number of GPUs. GPUs will be allocated to the container | ||
| Line 41: | Line 46: | ||
# | # | ||
# PLEASE DO NOT SET THE NUMBER TO ZERO, EVER, AND ALWAYS INCLUDE THIS LINE. | # PLEASE DO NOT SET THE NUMBER TO ZERO, EVER, AND ALWAYS INCLUDE THIS LINE. | ||
| + | # ALWAYS PUT IT IN THE SECTION "limits", NOT "requests". | ||
# | # | ||
# It is a known limitation of nVidias runtime that if zero GPUs are requested, | # It is a known limitation of nVidias runtime that if zero GPUs are requested, | ||
| Line 47: | Line 53: | ||
# | # | ||
nvidia.com/gpu: "1" | nvidia.com/gpu: "1" | ||
| − | + | ||
| − | + | # the command which is executed after container creation | |
command: ["/application/run.sh"] | command: ["/application/run.sh"] | ||
| Line 62: | Line 68: | ||
# number of retries after failure. | # number of retries after failure. | ||
# since we typically have to fix something in this case, set to zero by default. | # since we typically have to fix something in this case, set to zero by default. | ||
| − | backoffLimit: 0 | + | backoffLimit: 0</syntaxhighlight> |
| − | </syntaxhighlight> | ||
When we start this job, it will create a single container based on the image we previously uploaded to the registry on a suitable node which serves the selected namespace of the cluster. | When we start this job, it will create a single container based on the image we previously uploaded to the registry on a suitable node which serves the selected namespace of the cluster. | ||
Revision as of 14:58, 18 June 2019
Contents
Requirements
- A working connection and login to the Kubernetes cluster.
- A valid namespace selected with authorization to run pods.
- A test container pushed to the CCU docker registry.
Set up a Kubernetes job script
Download the Kubernetes samples and look at job script in example_1. Alternatively, create your own directory and file named "job_script.yaml". Edit the contents as follows and replace all placeholders with your data:
apiVersion: batch/v1
kind: Job
metadata:
# name of the job
name: tf-mnist
spec:
template:
spec:
# List of containers belonging to the job starts here
containers:
# container name used for pod creation
- name: tf-mnist-container
# container image from the registry
image: ccu.uni-konstanz.de:5000/bastian.goldluecke/tf_mnist:0.1
# container resources requested from the node
resources:
# requests are minimum resourcerequirements
requests:
# this gives us a minimum 2 GiB of main memory to work with.
memory: "2Gi"
# limits are maximum resource allocations
limits:
# this gives an absolute limit of 3 GiB of main memory.
# exceeding it will mean the container exits immediately with an error.
memory: "3Gi"
# this requests a number of GPUs. GPUs will be allocated to the container
# exclusively. No fractional GPUs can be requested.
# When executing nvidia-smi in the container, it should show exactly this
# number of GPUs.
#
# PLEASE DO NOT SET THE NUMBER TO ZERO, EVER, AND ALWAYS INCLUDE THIS LINE.
# ALWAYS PUT IT IN THE SECTION "limits", NOT "requests".
#
# It is a known limitation of nVidias runtime that if zero GPUs are requested,
# then actually *all* GPUs are exposed in the container.
# We are looking for a fix to this.
#
nvidia.com/gpu: "1"
# the command which is executed after container creation
command: ["/application/run.sh"]
# login credentials to the docker registry.
# for convenience, a readonly credential is provided as a secret in each namespace.
imagePullSecrets:
- name: registry-ro-login
# containers will never restart
restartPolicy: Never
# number of retries after failure.
# since we typically have to fix something in this case, set to zero by default.
backoffLimit: 0
When we start this job, it will create a single container based on the image we previously uploaded to the registry on a suitable node which serves the selected namespace of the cluster.
> kubectl apply -f job_script.yaml
Checking in on the job
We first check if our container is running.
> kubectl get pods
# somewhere in the output you should see a line like this:
NAME READY STATUS RESTARTS AGE
tf-mnist-xxxx 1/1 Running 0 7s
Now that you now the name of the pod, you can check in on the logs:
# replace xxxx with the code from get pods.
> kubectl logs tf-mnist-xxxx
# this should show the console output of your python program
or get some more information about the job, the node the pod was placed on etc.
> kubectl describe job tf-mnist
# replace xxxx with the code from get pods.
> kubectl describe pod tf-mnist-xxxx
You can also open a shell in the running container, just as with docker:
> kubectl exec -it tf-mnist-xxxx /bin/bash
root@tf-mnist-xxxxx:/workspace# nvidia-smi
Tue Jun 18 14:25:00 2019
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.67 Driver Version: 418.67 CUDA Version: 10.1 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla V100-SXM3... On | 00000000:E7:00.0 Off | 0 |
| N/A 39C P0 68W / 350W | 30924MiB / 32480MiB | 6% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
+-----------------------------------------------------------------------------+
root@tf-mnist-xxxxx:/workspace# ls /application/
nn.py run.sh tf-mnist.py
root@tf-mnist-xxxxx:/workspace#
Shutting down the job early
If while inspecting the job you notice that it does not run correctly, you can shut it down prematurely with
> kubectl delete -f job_script.yaml
Note that this also deletes all data your container might have written to its filesystem layer. If you want to save your trained models, you have to mount persistent volumes from the Kubernetes cluster into the container. This is covered in the persistent volume tutorial.