Tutorials:container which trains MNIST using Tensorflow
Contents
Overview
In this example, we study in depth how to create a machine learning container which can be run on the cluster. In principle, this works just like creating any other docker container. However, from the very beginning, we should write our code so that it fits a few special conventions, in particular about where you read/write your data. While it is in principle possible to map the directories on the cluster node to any directory which is used by your program, it is advised that you stick to a certain structure, in particular if you intend your code to be easily parsed by other people.
We will start simple, and then gradually add more capability for our program. To get started, download the tutorial code here (TODO: upload and add link).
Pre-requisites
You have prepared your system according to the previous tutorials, in particular,
- you have a working version of nvidia-docker installed on your system.
- you are logged into the nVidia GPU cloud docker registry.
- if you want to run the examples directly on your own system, without using a container, you also have to install Tensorflow for Python and a number of recommended packages. On Ubuntu:
# make sure you do this only if you have not installed
# tensorflow already from another source (i.e. self-compiled).
sudo apt install python-pip
sudo pip install scipy numpy tensorflow
Basic example without reading/writing persistent data
Check out the subdirectory "example_1" of the tutorial code. The structure is as follows:
-- example_1
-- docker-compose.yml
-- application
-- Dockerfile
-- src
-- run.sh
-- train_tf_mnist.py
-- nn_tf_mnist.py
In the subdirectory "application/src" is the actual Python code of the project, the rest are directives how to build and run the container. Let's first take a look at the application to get that out of the way. You should be able to run it directly on your system, without using containers:
cd example_1/application/src
python train_tf_mnist.py
Try it out, it should download the MNIST dataset to "/tmp/mnist" and then display some output about the training process. Let's take a look at the source. It has a sufficient number of comments so that it is hopefully self-explaining:
TODO