Difference between revisions of "Tutorials:Run the example container on the cluster"
Jump to navigation
Jump to search
(→Checking in on the container) |
(→Set up a Kubernetes job script) |
||
| Line 11: | Line 11: | ||
<syntaxhighlight lang="yaml"> | <syntaxhighlight lang="yaml"> | ||
| + | apiVersion: batch/v1 | ||
| + | kind: Job | ||
| + | metadata: | ||
| + | # name of the job | ||
| + | name: tf-mnist | ||
| + | |||
| + | spec: | ||
| + | template: | ||
| + | spec: | ||
| + | # List of containers belonging to the job starts here | ||
| + | containers: | ||
| + | # container name used for pod creation | ||
| + | - name: tf-mnist-container | ||
| + | # container image from the registry | ||
| + | image: ccu.uni-konstanz.de:5000/bastian.goldluecke/tf_mnist:0.1 | ||
| + | |||
| + | # container resources requested from the node | ||
| + | resources: | ||
| + | # limits are minimum requirements | ||
| + | limits: | ||
| + | # this gives us 2 GiB of main memory. Note that this is a hard limit, | ||
| + | # exceeding it will mean the container exits immediately with an error. | ||
| + | memory: "2Gi" | ||
| + | |||
| + | # this requests a number of GPUs. GPUs will be allocated to the container | ||
| + | # exclusively. No fractional GPUs can be requested. | ||
| + | # When executing nvidia-smi in the container, it should show exactly this | ||
| + | # number of GPUs. | ||
| + | # | ||
| + | # PLEASE DO NOT SET THE NUMBER TO ZERO, EVER, AND ALWAYS INCLUDE THIS LINE. | ||
| + | # | ||
| + | # It is a known limitation of nVidias runtime that if zero GPUs are requested, | ||
| + | # then actually *all* GPUs are exposed in the container. | ||
| + | # We are looking for a fix to this. | ||
| + | # | ||
| + | nvidia.com/gpu: "1" | ||
| + | requests: | ||
| + | memory: "2Gi" | ||
| + | command: ["/application/run.sh"] | ||
| + | |||
| + | |||
| + | # login credentials to the docker registry. | ||
| + | # for convenience, a readonly credential is provided as a secret in each namespace. | ||
| + | imagePullSecrets: | ||
| + | - name: registry-ro-login | ||
| + | |||
| + | # containers will never restart | ||
| + | restartPolicy: Never | ||
| + | |||
| + | # number of retries after failure. | ||
| + | # since we typically have to fix something in this case, set to zero by default. | ||
| + | backoffLimit: 0 | ||
</syntaxhighlight> | </syntaxhighlight> | ||
Revision as of 14:31, 18 June 2019
Requirements
- A working connection and login to the Kubernetes cluster.
- A valid namespace selected with authorization to run pods.
- A test container pushed to the CCU docker registry.
Set up a Kubernetes job script
Download the Kubernetes samples and look at job script in example_1. Alternatively, create your own directory and file named "job_script.yaml". Edit the contents as follows and replace all placeholders with your data:
apiVersion: batch/v1
kind: Job
metadata:
# name of the job
name: tf-mnist
spec:
template:
spec:
# List of containers belonging to the job starts here
containers:
# container name used for pod creation
- name: tf-mnist-container
# container image from the registry
image: ccu.uni-konstanz.de:5000/bastian.goldluecke/tf_mnist:0.1
# container resources requested from the node
resources:
# limits are minimum requirements
limits:
# this gives us 2 GiB of main memory. Note that this is a hard limit,
# exceeding it will mean the container exits immediately with an error.
memory: "2Gi"
# this requests a number of GPUs. GPUs will be allocated to the container
# exclusively. No fractional GPUs can be requested.
# When executing nvidia-smi in the container, it should show exactly this
# number of GPUs.
#
# PLEASE DO NOT SET THE NUMBER TO ZERO, EVER, AND ALWAYS INCLUDE THIS LINE.
#
# It is a known limitation of nVidias runtime that if zero GPUs are requested,
# then actually *all* GPUs are exposed in the container.
# We are looking for a fix to this.
#
nvidia.com/gpu: "1"
requests:
memory: "2Gi"
command: ["/application/run.sh"]
# login credentials to the docker registry.
# for convenience, a readonly credential is provided as a secret in each namespace.
imagePullSecrets:
- name: registry-ro-login
# containers will never restart
restartPolicy: Never
# number of retries after failure.
# since we typically have to fix something in this case, set to zero by default.
backoffLimit: 0
When we start this job, it will create a single container based on the image we previously uploaded to the registry on a suitable node which serves the selected namespace of the cluster.
> kubectl apply -f job_script.yaml
Checking in on the container
We first check if our container is running.
> kubectl get pods
# somewhere in the output you should see a line like this:
NAME READY STATUS RESTARTS AGE
tf-mnist-xxxx 1/1 Running 0 7s
Now that you now the name of the pod, you can check in on the logs:
# replace xxxx with the code from get pods.
> kubectl logs tf-mnist-xxxx
# this should show the console output of your python program
or get some more information about the job, the node the pod was placed on etc.
> kubectl describe job tf-mnist
# replace xxxx with the code from get pods.
> kubectl describe pod tf-mnist-xxxx
You can also open a shell in the running container, just as with docker:
> kubectl exec -it tf-mnist-xxxx /bin/bash
root@tf-mnist-xxxxx:/workspace# nvidia-smi
Tue Jun 18 14:25:00 2019
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.67 Driver Version: 418.67 CUDA Version: 10.1 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla V100-SXM3... On | 00000000:E7:00.0 Off | 0 |
| N/A 39C P0 68W / 350W | 30924MiB / 32480MiB | 6% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
+-----------------------------------------------------------------------------+
root@tf-mnist-xxxxx:/workspace# ls /application/
nn.py run.sh tf-mnist.py
root@tf-mnist-xxxxx:/workspace#