Tutorials:container which trains MNIST using Tensorflow

From Collective Computational Unit
Jump to navigation Jump to search

Contents

Overview

In this example, we study in depth how to create a machine learning container which can be run on the cluster. In principle, this works just like creating any other docker container - the main difficulty will be how to store data in a persistent manner. We will start simple, and then gradually add more capability for our program. To get started, download the tutorial code File:Container tf mnist example 1.zip.

Pre-requisites

You have prepared your system according to the previous tutorials, in particular,

  1. you have a working version of nvidia-docker installed on your system.
  2. you are logged into the nVidia GPU cloud docker registry.
  3. if you want to run the examples directly on your own system, without using a container, you also have to install Tensorflow for Python and a number of recommended packages. On Ubuntu:
# make sure you do this only if you have not installed
# tensorflow already from another source (i.e. self-compiled).
sudo apt install python-pip python-setuptools
sudo -H pip install scipy numpy tensorflow-gpu

Basic example without reading/writing persistent data

Check out the subdirectory "example_1" of the tutorial code. The structure is as follows:

-- example_1
   -- docker-compose.yml
   -- application
      -- Dockerfile
      -- src
         -- run.sh
         -- train_tf_mnist.py
         -- nn_tf_mnist.py

In the subdirectory "application/src" is the actual Python code of the project, the rest are directives how to build and run the container. Let's first take a look at the application to get that out of the way. You should be able to run it directly on your system, without using containers:

cd example_1/application/src
python train_tf_mnist.py

Try it out, it should download the MNIST dataset (if not already on your system) and then display some output about the training process. We will not take a look at the source code, you will understand it if you are familiar with Tensorflow. Instead, we will understand the Docker framework. The first important part is the docker-compose.yml.


docker-compose.yml

Together with the comments, it should be pretty much self-explanatory. In summary, this docker-compose is going to build the application container, tag it with a specific name, and then run it once on our system, using a pre-configured entrypoint (i.e. a command which will be executed after container creation). Please edit this file now and set your own username in the tag id of the image.

#
# This defines the version of the docker-compose.yml
# file format we are using.
#
version: '2.3'

#
# In this section, all the services we are going to
# start are defined. Each service corresponds to one
# container.
#
services:

    # Our application container is the only one we start.
    application:

        # This tells docker-compose that we intend to
        # build the application container from scratch, it
        # is not just a pre-existing image. The build configuration
        # (kind of a makefile) resides in the subdirectory
        # "application" in the file "Dockerfile".
        build: 
            context: ./application
            dockerfile: Dockerfile

        # This gives the container which has been built a "tag".
        # A tag is a unique name which you can use to refer to this container.
        # It should be of the form "<registry>/<username>/<application>:<version>
        # If <version> is not specified, it will get the default "latest".
        #
        # The registry should be the one of the CCU, same with your
        # username. You can also use a temporary image name here and
        # later use the "docker tag" commmand to rename it to the final name
        # you want to push to the registry.
        #
        image: ccu.uni-konstanz.de:5000/<your.username>/tf_mnist:0.1

        # The container needs the nvidia container runtime.
        # The following is equivalent to specifying "docker run --runtime=nvidia".
        # It is not necessary if nvidia is already configured as the
        # default runtime (as on the Kubernetes cluster).
        runtime: nvidia

        # Environment variables set when running the image,
        # which can for example used to configure the nVidia base
        # container or your application. You can use these to
        # configure your own code as well.
        #
        environment:
          - NVIDIA_VISIBLE_DEVICES=all

        # This container should only be started once, if it fails,
        # we have to fix errors anyway, if it exits successfully,
        # we are happy.
        restart: "no"

        # The entry point of the container, i.e. the script or executable
        # which is started after it has been created.
        entrypoint: "/application/run.sh"

application/Dockerfile

This is equivalent to a makefile for our application container. Also hopefully self-explanatory due to the comments.

# First, we define the base image of the container.
#
# Our example container is derived from nVidia's image
# 'tensorflow:18.06-py3', the code stands
# for container version 18.06 (nVidia's internal version number),
# which contains tensorflow set up for python3
#
FROM nvcr.io/nvidia/tensorflow:18.06-py3

#
# This is the maintainer of the container, referenced
# by e-Mail address.
#
MAINTAINER ccu@uni-konstanz.de

#
# This is the first line which tells us how this container
# differs from the base image.
#
# In this case, we copy the subdirectory "src" from
# the directory containing the Dockerfile into the
# directory "/application" of the container image.
#
COPY src /application

#
# Many COPY commands can be issued, as well as RUN
# commands which run commands inside the container, e.g.
# to install stuff.
#
# The following is just an example, it is not necessary for
# the application to run. The final container image will
# now contain the "nano" editor just in case you need it
# when logging into the container (yes, you can do this while it's
# running). You should always squeeze as many package
# installations as possible into one RUN command, as each one
# will generate a new intermediate container image.
#
# Note that COPY as well as RUN are executed with sudo
# privileges inside the container. Yes, this also means
# you can access anything on the host, that's why being
# a docker user is basically equivalent to being a sudo user
# on a system. When running on the Kubernetes cluster,
# container privileges are of course much more limited.
#
RUN apt-get update && apt-get install -y nano

#
# this is what will be executed by default when the container is run
#
ENTRYPOINT [ "/application/run.sh" ]

That's basically already what is different from a container deployment to executing the application directly on your system. Of course, there are more details to learn, in particular about how to mount external filesystems of the cluster into the container, so you can read/write persistent data. More on this later. Let's now run the container defined in the above configuration files.

Building and running the container

In the example_1 directory, do a

> docker-compose up --build

Note the build flag, if it is not there, the container will not be rebuilt if the image already exists, even if there were changes in the code. You should see the base container image being downloaded (if not already on your system), and then the changes specified in the Dockerfile being made. Afterwards, the container is run, which should create similar output as if running it directly on your system. Note that I have baked artificial delay into the code, training is much slower than it should be. This is so we have time to try some other things we can do with running containers, which we do now.

Stop the running container anytime using Ctrl-C.

Inspecting a running container

Let's start the container in the background using the -d flag. Note that you will not see console output anymore.

> docker-compose up --build -d

The container is now listed under running containers.

> docker ps
CONTAINER ID IMAGE                                                    COMMAND                CREATED        STATUS        PORTS     NAMES
bcc2a2e8f42e ccu.uni-konstanz.de:5000/bastian.goldluecke/tf_mnist:0.1 "/application/run.sh"  6 minutes ago  Up 6 minutes  6006/tcp  example_1_application_1

We can check the console output by reading the logs, using the ID or the name of the container.

> docker logs example_1_application_1

We can also open a shell inside the container, and inspect it's contents. In fact, you can also execute any other command in it.

> docker exec -it example_1_application_1 /bin/bash
root@fcb20664b49e:/workspace# nvidia-smi
Wed Jun  5 10:06:31 2019       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.67       Driver Version: 418.67       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Quadro GV100        On   | 00000000:03:00.0  On |                  Off |
| 38%   52C    P2    42W / 250W |  30981MiB / 32475MiB |      1%      Default |
+-------------------------------+----------------------+----------------------+

root@fcb20664b49e:/workspace# ls /application
nn.py  run.sh  tf-mnist.py
root@fcb20664b49e:/workspace#

We can also copy the output of the container into the filesystem of the host using an scp-like syntax:

> docker cp example_1_application_1:/tmp/data ./tmp

Once you are done with trying things out, stop the container with

> docker kill example_1_application_1

Uploading the final container to the cluster registry

Once you are happy with your code, you can upload the container image to the cluster registry just as in the basic tutorial before. Replace the tag by whatever you have used in the docker-compose.yml.

docker push ccu-k8s.inf.uni-konstanz.de:32250/<your.username>/tf_mnist:0.1

You can now execute this image from any computer which has the nvidia docker infrastructure installed and is logged into our cluster registry, for example like this:

docker run --runtime=nvidia -d ccu.uni-konstanz.de:5000/<your.username>/tf_mnist:0.1

This should give you an idea of how powerful this framework is. In particular, the image can now be executed from any compute node in the cluster, and it is ready to be deployed using Kubernetes.

Remark: persistent storage in the container

A container brings its own temporary file system, everything which is written to it will by default not impact the host. After the container terminates, all data which was stored on this temporary filesystem is lost. To write to persistent storage from inside the container which survives container destruction, you have to mount a host filesystem. This can be done in the docker-compose.yml like this (add to the "application" container section):

Everything which is written inside the container to "application/output" will now end up the in the "output" directory on the host (paths are relative to the docker compose config). Vice versa, whatever was/is put into the host directory is now available inside the container. You can also mount read-only if you just want to import data from the host:

Note that this mechanism is only explained for reference and your own testing, it will not work on the cluster. See the detailed tutorial on Persistent data on the cluster for how it works here.

Remark: exposing container network ports on the host

One big use case of containers are web applications, i.e. the container acts as a webserver on some port. For example, the whole CCU web infrastructure lives inside several interconnected containers which expose ports on the main server. You can have for example a web server listening on port 80 inside the container, and map this port to an arbitrary port on your host, connect containers to each other, and so on. If you are interested, check out the excellent docker tutorials available online, for example this one.