Tutorials:Run the example container on the cluster

From Collective Computational Unit
Revision as of 14:31, 18 June 2019 by Bastian.goldluecke (talk | contribs) (Set up a Kubernetes job script)
Jump to navigation Jump to search

Requirements

  • A working connection and login to the Kubernetes cluster.
  • A valid namespace selected with authorization to run pods.
  • A test container pushed to the CCU docker registry.


Set up a Kubernetes job script

Download the Kubernetes samples and look at job script in example_1. Alternatively, create your own directory and file named "job_script.yaml". Edit the contents as follows and replace all placeholders with your data:

apiVersion: batch/v1
kind: Job
metadata:
  # name of the job
  name: tf-mnist

spec:
  template:
    spec:
      # List of containers belonging to the job starts here
      containers:
      # container name used for pod creation
      - name: tf-mnist-container
        # container image from the registry
        image: ccu.uni-konstanz.de:5000/bastian.goldluecke/tf_mnist:0.1

        # container resources requested from the node
        resources:
          # limits are minimum requirements
          limits:
            # this gives us 2 GiB of main memory. Note that this is a hard limit,
            # exceeding it will mean the container exits immediately with an error.
            memory: "2Gi"

            # this requests a number of GPUs. GPUs will be allocated to the container
            # exclusively. No fractional GPUs can be requested.
            # When executing nvidia-smi in the container, it should show exactly this
            # number of GPUs.
            #
            # PLEASE DO NOT SET THE NUMBER TO ZERO, EVER, AND ALWAYS INCLUDE THIS LINE.
            #
            # It is a known limitation of nVidias runtime that if zero GPUs are requested,
            # then actually *all* GPUs are exposed in the container.
            # We are looking for a fix to this.
            #
            nvidia.com/gpu: "1"
          requests:
            memory: "2Gi"
        command: ["/application/run.sh"]


      # login credentials to the docker registry.
      # for convenience, a readonly credential is provided as a secret in each namespace.
      imagePullSecrets:
      - name: registry-ro-login

      # containers will never restart
      restartPolicy: Never

  # number of retries after failure.
  # since we typically have to fix something in this case, set to zero by default.
  backoffLimit: 0

When we start this job, it will create a single container based on the image we previously uploaded to the registry on a suitable node which serves the selected namespace of the cluster.

> kubectl apply -f job_script.yaml

Checking in on the container

We first check if our container is running.

> kubectl get pods
# somewhere in the output you should see a line like this:
NAME             READY   STATUS    RESTARTS   AGE
tf-mnist-xxxx   1/1     Running   0          7s

Now that you now the name of the pod, you can check in on the logs:

# replace xxxx with the code from get pods.
> kubectl logs tf-mnist-xxxx
# this should show the console output of your python program

or get some more information about the job, the node the pod was placed on etc.

> kubectl describe job tf-mnist
# replace xxxx with the code from get pods.
> kubectl describe pod tf-mnist-xxxx


You can also open a shell in the running container, just as with docker:

> kubectl exec -it tf-mnist-xxxx /bin/bash
root@tf-mnist-xxxxx:/workspace# nvidia-smi
Tue Jun 18 14:25:00 2019       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.67       Driver Version: 418.67       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla V100-SXM3...  On   | 00000000:E7:00.0 Off |                    0 |
| N/A   39C    P0    68W / 350W |  30924MiB / 32480MiB |      6%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+
root@tf-mnist-xxxxx:/workspace# ls /application/
nn.py  run.sh  tf-mnist.py
root@tf-mnist-xxxxx:/workspace#