Changes

CCU:New GPU Cluster

294 bytes removed, 4 years ago

m

→‎Moving your workloads to the new cluster

Note that there are timeouts in place - this is a demo pod which runs only for 24 hours and an interactive session also has a time limit, so it is better to build a custom run script which is executed when the container in the pod starts. See the Kubernetes documentation on pods and jobs for more details. TODO: link to respective doc.

'''Important known issue: Currently, there seem to be problems with running Tensorflow applications on ~~at least one of~~ the ~~compute nodes (~~node zariel~~), the nVidia GPU device is not always created as "/dev/nvidia0", but sometimes as "/dev/nvidiaX", where X is another number (likely the device ID on the host)~~. ~~This makes Tensorflow fail to detect~~ Please use tiamat in the ~~GPU~~meantime. ~~While I am searching for an actual solution, a workaround is~~'''~~<syntaxhighlight>(shell inside the container)# ln -s /dev/nvidiaX /dev/nvidia0</syntaxhighlight>~~

If you do not have your code ready, you can do a quick test if GPU execution works by running demo code from [https://github.com/dragen1860/TensorFlow-2.x-Tutorials this tutorial] as follows:

Bastian.goldluecke

ccu, Administrators

684

edits

Changes

CCU:New GPU Cluster

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Collective Computational Unit

Mediawiki

Tools