Tutorials:container which trains MNIST using Tensorflow

Overview

In this example, we study in depth how to create a machine learning container which can be run on the cluster. In principle, this works just like creating any other docker container. However, from the very beginning, we should write our code so that it fits a few special conventions, in particular about where you read/write your data. While it is in principle possible to map the directories on the cluster node to any directory which is used by your program, it is advised that you stick to a certain structure, in particular if you intend your code to be easily parsed by other people.

We will start simple, and then gradually add more capability for our program. To get started, download the tutorial code here (TODO: upload and add link).

Pre-requisites

You have prepared your system according to the previous tutorials, in particular,

you have a working version of nvidia-docker installed on your system.
you are logged into the nVidia GPU cloud docker registry.
if you want to run the examples directly on your own system, without using a container, you also have to install Tensorflow for Python and a number of recommended packages. On Ubuntu:

# make sure you do this only if you have not installed
# tensorflow already from another source (i.e. self-compiled).
sudo apt install python-pip
sudo pip install scipy numpy tensorflow

Basic example without reading/writing persistent data

Check out the subdirectory "example_1" of the tutorial code. The structure is as follows:

-- example_1
   -- docker-compose.yml
   -- application
      -- Dockerfile
      -- src
         -- run.sh
         -- train_tf_mnist.py
         -- nn_tf_mnist.py

In the subdirectory "application/src" is the actual Python code of the project, the rest are directives how to build and run the container. Let's first take a look at the application to get that out of the way. You should be able to run it directly on your system, without using containers:

cd example_1/application/src
python train_tf_mnist.py

Try it out, it should download the MNIST dataset to "/tmp/mnist" and then display some output about the training process. Let's take a look at the source. It has a sufficient number of comments so that it is hopefully self-explaining:

TODO

Logging console output of our program

Monitoring the training process with Tensorboard

Writing the trained/intermediate models to persistent storage

Tutorials:container which trains MNIST using Tensorflow

Contents

Overview

Pre-requisites

Basic example without reading/writing persistent data

Logging console output of our program

Monitoring the training process with Tensorboard

Writing the trained/intermediate models to persistent storage

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Collective Computational Unit

Mediawiki

Tools

Print/export