Difference between revisions of "CCU:Global dataset storage"
Leichen.wang (talk | contribs) (→List of datasets in global storage) |
Leichen.wang (talk | contribs) (→List of datasets in global storage) |
||
| Line 40: | Line 40: | ||
</syntaxhighlight> | </syntaxhighlight> | ||
| + | If you want to access to this global dataset, just tip | ||
| + | <syntaxhighlight lang="bash"> | ||
| + | > kubectl get pvc | ||
| + | </syntaxhighlight> | ||
| + | |||
| + | If you want to bind your project to this global dataset, please change your '''job-script-pvc.yaml''' like following: | ||
| + | |||
| + | <syntaxhighlight lang="bash"> | ||
| + | # list of mount paths within the container which will be | ||
| + | # bound to persistent volumes. | ||
| + | volumeMounts: | ||
| + | # /tmp/data is where mnist.py will download the training data, unless it is | ||
| + | # already there. This is the path where we mount the persistent volume. | ||
| + | # Thus, for the second run of this container, the data will | ||
| + | # not be downloaded again. | ||
| + | - mountPath: "/mnt/dataset_kitti" | ||
| + | # # name of the volume for this path (from the below list) | ||
| + | name: dataset-kitti-user | ||
| + | |||
| + | # login credentials to the docker registry. | ||
| + | # for convenience, a readonly credential is provided as a secret in each namespace. | ||
| + | imagePullSecrets: | ||
| + | - name: registry-ro-login | ||
| + | |||
| + | # containers will never restart | ||
| + | restartPolicy: Never | ||
| + | |||
| + | volumes: | ||
| + | # User-defined name of the persistent volume within this configuration. | ||
| + | # This can be different from the name of the PVC. | ||
| + | - name: dataset-kitti-user | ||
| + | persistentVolumeClaim: | ||
| + | # name of the PVC this volume binds to | ||
| + | claimName: dataset-kitti | ||
| + | </syntaxhighlight> | ||
Revision as of 15:38, 19 July 2020
Contents
Overview
The global dataset storage is intended for large, static datasets, in particular those which benefit multiple users (but feel free to also use it for your own data which only you need). Write access is very slow since it is tunneled over a slow filesystem for security and backup reasons (see below for technical details), so it will take a while until your datasets actually show up on the cluster. Read access, however, should be very fast (the NVMe RAID where it resides has 1.9 GB/s read speed, it is accessed over a 10 GBit/s Network from nodes other than the DGX-2), and might in some cases even surpass local storage.
The global storage can be easily mounted in any container on any node as a read-only volume, while you have to write to it using certain rsync commands on the master node. See below for detailed instructions. Every user has their own subdirectory within the global storage (readable by everyone, writeable only by that user). In addition, there is a user-independent directory subtree with common machine learning datasets. If you believe you have a dataset in your own subdirectory which is static and beneficial for many users, please contact me to move it to the common tree.
Writing your data to the global storage
Accessing the global storage from within a container
Please see this page for an introduction on how to use the datasets.
List of datasets in global storage
Everyone, please update this list if you have any useful datasets to share. Feel free to generate additional pages on the Wiki in case a dataset needs more description, or link to your project page in the respective column (see example below).
The KITTI Vision Dataset
- Description:
Kitti contains a suite of vision tasks built using an autonomous driving platform. The full benchmark contains many tasks such as stereo, optical flow, visual odometry, etc. This dataset contains the object detection dataset, including the monocular images and bounding boxes. The dataset contains 7481 training images annotated with 3D bounding boxes. A full description of the annotations can be found in the readme of the object development kit readme on the Kitti homepage.
- Homepage: http://www.cvlibs.net/datasets/kitti/
- Benchmark: http://www.cvlibs.net/datasets/kitti/eval_object.php?obj_benchmark=3d
- Source code on TensorFlow: http://tfds.object_detection.Kitti
- Dataset size: 85 GiB
To find the dataset, just tip
> kubectl get pvc
and you will get the information like following:
pvc name: dataset-kitti, capacity: 100Gi, access modes: RWO, storageclass:ceph-ssd
If you want to access to this global dataset, just tip
> kubectl get pvc
If you want to bind your project to this global dataset, please change your job-script-pvc.yaml like following:
# list of mount paths within the container which will be
# bound to persistent volumes.
volumeMounts:
# /tmp/data is where mnist.py will download the training data, unless it is
# already there. This is the path where we mount the persistent volume.
# Thus, for the second run of this container, the data will
# not be downloaded again.
- mountPath: "/mnt/dataset_kitti"
# # name of the volume for this path (from the below list)
name: dataset-kitti-user
# login credentials to the docker registry.
# for convenience, a readonly credential is provided as a secret in each namespace.
imagePullSecrets:
- name: registry-ro-login
# containers will never restart
restartPolicy: Never
volumes:
# User-defined name of the persistent volume within this configuration.
# This can be different from the name of the PVC.
- name: dataset-kitti-user
persistentVolumeClaim:
# name of the PVC this volume binds to
claimName: dataset-kitti
TBC->nuScenes dataset