Difference between revisions of "Tutorials:container which trains MNIST using Tensorflow"

From Collective Computational Unit
Jump to navigation Jump to search
m (Basic example without reading/writing persistent data)
m (docker-compose.yml)
Line 47: Line 47:
 
=== docker-compose.yml ===
 
=== docker-compose.yml ===
  
Together with the comments, it should be pretty much self-explanatory. In summary, this docker-compose is going to build a new container, tag it with a specific name, and then run it on our system, using a pre-configured entrypoint (i.e. a command which will be executed after container creation).
+
Together with the comments, it should be pretty much self-explanatory. In summary, this docker-compose is going to build a new container, tag it with a specific name, and then run it once on our system, using a pre-configured entrypoint (i.e. a command which will be executed after container creation).
  
 
<syntaxhighlight lang="yaml">
 
<syntaxhighlight lang="yaml">

Revision as of 08:52, 5 June 2019

Overview

In this example, we study in depth how to create a machine learning container which can be run on the cluster. In principle, this works just like creating any other docker container. However, from the very beginning, we should write our code so that it fits a few special conventions, in particular about where you read/write your data. While it is in principle possible to map the directories on the cluster node to any directory which is used by your program, it is advised that you stick to a certain structure, in particular if you intend your code to be easily parsed by other people.

We will start simple, and then gradually add more capability for our program. To get started, download the tutorial code here (TODO: upload and add link).

Pre-requisites

You have prepared your system according to the previous tutorials, in particular,

  1. you have a working version of nvidia-docker installed on your system.
  2. you are logged into the nVidia GPU cloud docker registry.
  3. if you want to run the examples directly on your own system, without using a container, you also have to install Tensorflow for Python and a number of recommended packages. On Ubuntu:
# make sure you do this only if you have not installed
# tensorflow already from another source (i.e. self-compiled).
sudo apt install python-pip python-setuptools
sudo -H pip install scipy numpy tensorflow-gpu

Basic example without reading/writing persistent data

Check out the subdirectory "example_1" of the tutorial code. The structure is as follows:

-- example_1
   -- docker-compose.yml
   -- application
      -- Dockerfile
      -- src
         -- run.sh
         -- train_tf_mnist.py
         -- nn_tf_mnist.py

In the subdirectory "application/src" is the actual Python code of the project, the rest are directives how to build and run the container. Let's first take a look at the application to get that out of the way. You should be able to run it directly on your system, without using containers:

cd example_1/application/src
python train_tf_mnist.py

Try it out, it should download the MNIST dataset (if not already on your system) and then display some output about the training process. We will not take a look at the source code, you will understand it if you are familiar with Tensorflow. Instead, we will understand the Docker framework. The first important part is the docker-compose.yml.


docker-compose.yml

Together with the comments, it should be pretty much self-explanatory. In summary, this docker-compose is going to build a new container, tag it with a specific name, and then run it once on our system, using a pre-configured entrypoint (i.e. a command which will be executed after container creation).

#
# This defines the version of the docker-compose.yml
# file format we are using.
#
version: '2.3'

#
# In this section, all the services we are going to
# start are defined. Each service corresponds to one
# container.
#
services:

    # Our application container is the only one we start.
    application:

        # This tells docker-compose that we intend to
        # build the application container from scratch, it
        # is not just a pre-existing image. The build configuration
        # (kind of a makefile) resides in the subdirectory
        # "application" in the file "Dockerfile".
        build: 
            context: ./application
            dockerfile: Dockerfile

        # This gives the container which has been built a "tag".
        # A tag is a unique name which you can use to refer to this container.
        # It should be of the form "<registry>/<username>/<application>:<version>
        # If <version> is not specified, it will get the default "latest".
        #
        # The registry should be the one of the CCU, same with your
        # username. You can also use a temporary image name here and
        # later use the "docker tag" commmand to rename it to the final name
        # you want to push to the registry.
        #
        image: ccu.uni-konstanz.de:5000/<your.username>/tf_mnist:0.1

        # The container needs the nvidia container runtime.
        # The following is equivalent to specifying "docker run --runtime=nvidia".
        # It is not necessary if nvidia is already configured as the
        # default runtime (as on the Kubernetes cluster).
        runtime: nvidia

        # Environment variables set when running the image,
        # which can for example used to configure the nVidia base
        # container or your application. You can use these to
        # configure your own code as well.
        #
        environment:
          - NVIDIA_VISIBLE_DEVICES=all

        # This container should only be started once, if it fails,
        # we have to fix errors anyway, if it exits successfully,
        # we are happy.
        restart: "no"

        # The entry point of the container, i.e. the script or executable
        # which is started after it has been created.
        entrypoint: "/application/run.sh"

Logging console output of our program

Monitoring the training process with Tensorboard

Writing the trained/intermediate models to persistent storage