Tutorials:container which trains MNIST using Tensorflow

From Collective Computational Unit
Revision as of 08:39, 19 May 2019 by Bastian.goldluecke (talk | contribs) (Basic example without reading/writing data)
Jump to navigation Jump to search

Overview

In this example, we study in depth how to create a machine learning container which can be run on the cluster. In principle, this works just like creating any other docker container. However, from the very beginning, we should write our code so that it fits a few special conventions, in particular about where you read/write your data. While it is in principle possible to map the directories on the cluster node to any directory which is used by your program, it is advised that you stick to a certain structure, in particular if you intend your code to be easily parsed by other people.

We will start simple, and then gradually add more capability for our program:

Basic example without reading/writing persistent data

Check out the subdirectory "example_1" of the tutorial code. The structure is as follows:

-- example_1
   -- docker-compose.yml
   -- application
      -- Dockerfile
      -- src
         -- run.sh
         -- train_tf_mnist.py
         -- nn_tf_mnist.py

In the subdirectory "application/src" is the actual Python code of the project, the rest are directives how to build and run the container. Let's first take a look at the application to get that out of the way. You should be able to run it directly on your system, without using containers:

cd example_1/application/src
python train_tf_mnist.py

Try it out, it should download the MNIST dataset to "/tmp/mnist" and then display some output about the training process. Let's take a look at the source. It has a sufficient number of comments so that it is hopefully self-explaining:

TODO

Logging console output of our program

Monitoring the training process with Tensorboard

Writing the trained/intermediate models to persistent storage