Difference between revisions of "Tutorials:Run the example container on the cluster"

From Collective Computational Unit
Jump to navigation Jump to search
m (Set up a Kubernetes job script)
(Set up a Kubernetes job script)
 
(8 intermediate revisions by 3 users not shown)
Line 8: Line 8:
 
== Set up a Kubernetes job script ==
 
== Set up a Kubernetes job script ==
  
Download the [[/File:Kubernetes_samples.zip|Kubernetes samples]] and look at job script in example_1. Alternatively, create your own directory and file named "job_script.yaml". Edit the contents as follows and replace all placeholders with your data:
+
Download the [[File:Kubernetes_samples.zip|Kubernetes samples]] and look at the kubernetes subdirectory in example_1. Check out "make_config.sh" and run it after you have set the bash environment variable "KUBERNETES_USER" to your cluster username:
 +
 
 +
<syntaxhighlight lang="bash">
 +
> export KUBERNETES_USER=your.username
 +
> ./make_configs.sh
 +
</syntaxhighlight>
 +
 
 +
This will create a number of yaml files (Kubernetes configuration files) from the templates in the "template" subdirectory. Check out the first example, "job-script.yaml":
 +
 
  
 
<syntaxhighlight lang="yaml">
 
<syntaxhighlight lang="yaml">
Line 15: Line 23:
 
metadata:
 
metadata:
 
   # name of the job
 
   # name of the job
   name: tf-mnist
+
   name: your-username-tf-mnist
  
 
spec:
 
spec:
Line 23: Line 31:
 
       containers:
 
       containers:
 
       # container name used for pod creation
 
       # container name used for pod creation
       - name: tf-mnist-container
+
       - name: your-username-tf-mnist-container
 
         # container image from the registry
 
         # container image from the registry
         image: ccu.uni-konstanz.de:5000/bastian.goldluecke/tf_mnist:0.1
+
         image: ccu.uni-konstanz.de:5000/your.username/tf_mnist:0.1
  
 
         # container resources requested from the node
 
         # container resources requested from the node
 
         resources:
 
         resources:
           # requests are minimum resourcerequirements
+
           # requests are minimum resource requirements
 
           requests:
 
           requests:
 
             # this gives us a minimum 2 GiB of main memory to work with.
 
             # this gives us a minimum 2 GiB of main memory to work with.
 
             memory: "2Gi"
 
             memory: "2Gi"
 +
            # you should allocate at least 1 CPU for machine learning jobs,
 +
            # usually more if you for example have seperate threads for reading data
 +
            # 1 CPU unit is 1 CPU core or hyperthread, depending on CPU architecture
 +
            # Note that these are typically not a scarce resource on our GPU servers,
 +
            # so you can be a bit generous.
 +
            cpu: 1
  
 
           # limits are maximum resource allocations
 
           # limits are maximum resource allocations
Line 39: Line 53:
 
             # exceeding it will mean the container exits immediately with an error.
 
             # exceeding it will mean the container exits immediately with an error.
 
             memory: "3Gi"
 
             memory: "3Gi"
 +
 +
            # CPU limit, but pod will usually not be killed for excessive CPU use
 +
            cpu: 1
  
 
             # this requests a number of GPUs. GPUs will be allocated to the container
 
             # this requests a number of GPUs. GPUs will be allocated to the container
Line 73: Line 90:
  
 
<syntaxhighlight lang="yaml">
 
<syntaxhighlight lang="yaml">
> kubectl apply -f job_script.yaml
+
> kubectl apply -f job-script.yaml
 
</syntaxhighlight>
 
</syntaxhighlight>
  
Line 84: Line 101:
 
# somewhere in the output you should see a line like this:
 
# somewhere in the output you should see a line like this:
 
NAME            READY  STATUS    RESTARTS  AGE
 
NAME            READY  STATUS    RESTARTS  AGE
tf-mnist-xxxx  1/1    Running  0          7s
+
your-username-tf-mnist-xxxx  1/1    Running  0          7s
 
</syntaxhighlight>
 
</syntaxhighlight>
  
Line 91: Line 108:
 
<syntaxhighlight lang="bash">
 
<syntaxhighlight lang="bash">
 
# replace xxxx with the code from get pods.
 
# replace xxxx with the code from get pods.
> kubectl logs tf-mnist-xxxx
+
> kubectl logs your-username-tf-mnist-xxxx
 
# this should show the console output of your python program
 
# this should show the console output of your python program
 
</syntaxhighlight>
 
</syntaxhighlight>
Line 98: Line 115:
  
 
<syntaxhighlight lang="bash">
 
<syntaxhighlight lang="bash">
> kubectl describe job tf-mnist
+
> kubectl describe job your-username-tf-mnist
 
# replace xxxx with the code from get pods.
 
# replace xxxx with the code from get pods.
> kubectl describe pod tf-mnist-xxxx
+
> kubectl describe pod your-username-tf-mnist-xxxx
 
</syntaxhighlight>
 
</syntaxhighlight>
  
Line 108: Line 125:
  
 
<syntaxhighlight lang="bash">
 
<syntaxhighlight lang="bash">
> kubectl exec -it tf-mnist-xxxx /bin/bash
+
> kubectl exec -it your-username-tf-mnist-xxxx /bin/bash
 
root@tf-mnist-xxxxx:/workspace# nvidia-smi
 
root@tf-mnist-xxxxx:/workspace# nvidia-smi
 
Tue Jun 18 14:25:00 2019       
 
Tue Jun 18 14:25:00 2019       
Line 130: Line 147:
 
root@tf-mnist-xxxxx:/workspace#  
 
root@tf-mnist-xxxxx:/workspace#  
 
</syntaxhighlight>
 
</syntaxhighlight>
 
 
  
 
== Shutting down the job early ==
 
== Shutting down the job early ==
Line 138: Line 153:
  
 
<syntaxhighlight lang="bash">
 
<syntaxhighlight lang="bash">
> kubectl delete -f job_script.yaml
+
> kubectl delete -f job-script.yaml
 
</syntaxhighlight>
 
</syntaxhighlight>
  
 
Note that this also deletes all data your container might have written to its filesystem layer. If you want to save your trained models, you have to mount persistent volumes from the Kubernetes cluster into the container. This is covered in the [[Tutorials:Persistent volumes on the Kubernetes cluster|persistent volume tutorial]].
 
Note that this also deletes all data your container might have written to its filesystem layer. If you want to save your trained models, you have to mount persistent volumes from the Kubernetes cluster into the container. This is covered in the [[Tutorials:Persistent volumes on the Kubernetes cluster|persistent volume tutorial]].
 +
 +
 +
 +
[[Category:Tutorials]]

Latest revision as of 08:55, 18 November 2020

Contents

Requirements

  • A working connection and login to the Kubernetes cluster.
  • A valid namespace selected with authorization to run pods.
  • A test container pushed to the CCU docker registry.


Set up a Kubernetes job script

Download the File:Kubernetes samples.zip and look at the kubernetes subdirectory in example_1. Check out "make_config.sh" and run it after you have set the bash environment variable "KUBERNETES_USER" to your cluster username:

> export KUBERNETES_USER=your.username
> ./make_configs.sh

This will create a number of yaml files (Kubernetes configuration files) from the templates in the "template" subdirectory. Check out the first example, "job-script.yaml":


apiVersion: batch/v1
kind: Job
metadata:
  # name of the job
  name: your-username-tf-mnist

spec:
  template:
    spec:
      # List of containers belonging to the job starts here
      containers:
      # container name used for pod creation
      - name: your-username-tf-mnist-container
        # container image from the registry
        image: ccu.uni-konstanz.de:5000/your.username/tf_mnist:0.1

        # container resources requested from the node
        resources:
          # requests are minimum resource requirements
          requests:
            # this gives us a minimum 2 GiB of main memory to work with.
            memory: "2Gi"
            # you should allocate at least 1 CPU for machine learning jobs,
            # usually more if you for example have seperate threads for reading data
            # 1 CPU unit is 1 CPU core or hyperthread, depending on CPU architecture
            # Note that these are typically not a scarce resource on our GPU servers,
            # so you can be a bit generous.
            cpu: 1

          # limits are maximum resource allocations
          limits:
            # this gives an absolute limit of 3 GiB of main memory.
            # exceeding it will mean the container exits immediately with an error.
            memory: "3Gi"

            # CPU limit, but pod will usually not be killed for excessive CPU use
            cpu: 1

            # this requests a number of GPUs. GPUs will be allocated to the container
            # exclusively. No fractional GPUs can be requested.
            # When executing nvidia-smi in the container, it should show exactly this
            # number of GPUs.
            #
            # PLEASE DO NOT SET THE NUMBER TO ZERO, EVER, AND ALWAYS INCLUDE THIS LINE.
            # ALWAYS PUT IT IN THE SECTION "limits", NOT "requests".
            #
            # It is a known limitation of nVidias runtime that if zero GPUs are requested,
            # then actually *all* GPUs are exposed in the container.
            # We are looking for a fix to this.
            #
            nvidia.com/gpu: "1"

        # the command which is executed after container creation
        command: ["/application/run.sh"]


      # login credentials to the docker registry.
      # for convenience, a readonly credential is provided as a secret in each namespace.
      imagePullSecrets:
      - name: registry-ro-login

      # containers will never restart
      restartPolicy: Never

  # number of retries after failure.
  # since we typically have to fix something in this case, set to zero by default.
  backoffLimit: 0

When we start this job, it will create a single container based on the image we previously uploaded to the registry on a suitable node which serves the selected namespace of the cluster.

> kubectl apply -f job-script.yaml

Checking in on the job

We first check if our container is running.

> kubectl get pods
# somewhere in the output you should see a line like this:
NAME             READY   STATUS    RESTARTS   AGE
your-username-tf-mnist-xxxx   1/1     Running   0          7s

Now that you now the name of the pod, you can check in on the logs:

# replace xxxx with the code from get pods.
> kubectl logs your-username-tf-mnist-xxxx
# this should show the console output of your python program

or get some more information about the job, the node the pod was placed on etc.

> kubectl describe job your-username-tf-mnist
# replace xxxx with the code from get pods.
> kubectl describe pod your-username-tf-mnist-xxxx


You can also open a shell in the running container, just as with docker:

> kubectl exec -it your-username-tf-mnist-xxxx /bin/bash
root@tf-mnist-xxxxx:/workspace# nvidia-smi
Tue Jun 18 14:25:00 2019       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.67       Driver Version: 418.67       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla V100-SXM3...  On   | 00000000:E7:00.0 Off |                    0 |
| N/A   39C    P0    68W / 350W |  30924MiB / 32480MiB |      6%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+
root@tf-mnist-xxxxx:/workspace# ls /application/
nn.py  run.sh  tf-mnist.py
root@tf-mnist-xxxxx:/workspace#

Shutting down the job early

If while inspecting the job you notice that it does not run correctly, you can shut it down prematurely with

> kubectl delete -f job-script.yaml

Note that this also deletes all data your container might have written to its filesystem layer. If you want to save your trained models, you have to mount persistent volumes from the Kubernetes cluster into the container. This is covered in the persistent volume tutorial.