Difference between revisions of "CCU:Global dataset storage"
Leichen.wang (talk | contribs) (→List of datasets in global storage) |
m (→List of datasets in global storage) |
||
| (6 intermediate revisions by the same user not shown) | |||
| Line 1: | Line 1: | ||
== Overview == | == Overview == | ||
| − | The global dataset storage is intended for large, static datasets, in particular those which benefit multiple users (but feel free to also use it for your own data which only you need). Write access is very slow since it is tunneled over a slow filesystem for security and backup reasons (see below for technical details), so it will take a while until your datasets actually show up on the cluster. Read access, however, should be very fast | + | The global dataset storage is intended for large, static datasets, in particular those which benefit multiple users (but feel free to also use it for your own data which only you need). Write access is very slow since it is tunneled over a slow filesystem for security and backup reasons (see below for technical details), so it will take a while until your datasets actually show up on the cluster. Read access, however, should be very fast, and might in some cases even surpass local storage. |
The global storage can be easily mounted in any container on any node as a read-only volume, while you have to write to it using certain rsync commands on the master node. See below for detailed instructions. Every user has their own subdirectory within the global storage (readable by everyone, writeable only by that user). In addition, there is a user-independent directory subtree with common machine learning datasets. If you believe you have a dataset in your own subdirectory which is static and beneficial for many users, please contact me to move it to the common tree. | The global storage can be easily mounted in any container on any node as a read-only volume, while you have to write to it using certain rsync commands on the master node. See below for detailed instructions. Every user has their own subdirectory within the global storage (readable by everyone, writeable only by that user). In addition, there is a user-independent directory subtree with common machine learning datasets. If you believe you have a dataset in your own subdirectory which is static and beneficial for many users, please contact me to move it to the common tree. | ||
| Line 7: | Line 7: | ||
== Writing your data to the global storage == | == Writing your data to the global storage == | ||
| + | The global storage is populated from the subdirectory "datasets/cluster" in your home directory on the CCU master node Lolth (ccu-master.inf.uni-konstanz.de, IP 134.34.224.84). If you try to login there with ssh, the shell is extremely limited for security reasons. However, you can run an rsync command to the server as follows. Assume the current directory contains a subdirectory "my_dataset" which you want to copy to the global storage on the cluster, then run | ||
| + | <syntaxhighlight lang="bash"> | ||
| + | > rsync -avz --info=progress2 ./my_dataset your.username@ccu-master.inf.uni-konstanz.de:datasets/cluster/ | ||
| + | </syntaxhighlight> | ||
| + | |||
| + | You now have a copy of your dataset on Lolth. What happens now is that roughly every hour, the datasets on Lolth are synced to the directory "/raid/datasets/your.username" on an NFS server. | ||
| + | This directory is exported and you can mount it into any container running on the cluster. Note that every user has read access to the whole directory tree, so you can use this method to share data between users as well. | ||
| + | As a side effect, you now also have two backups of your data on two different machines (however in the same rack, so not really fire-proof). | ||
| + | |||
| + | You can also delete data from Lolth by ssh'ing into the machine and using rm to delete stuff in the "datasets/cluster" subdirectory. During the hourly sync, data not present here will also be deleted from the global cluster storage. | ||
== Accessing the global storage from within a container == | == Accessing the global storage from within a container == | ||
| Line 32: | Line 42: | ||
<syntaxhighlight lang="bash"> | <syntaxhighlight lang="bash"> | ||
# KITTI Dataset | # KITTI Dataset | ||
| − | └── / | + | └── /raid/datasets/general/kitti |
├── training <-- 7481 train data | ├── training <-- 7481 train data | ||
| ├── image_2 <-- for visualization | | ├── image_2 <-- for visualization | ||
| Line 47: | Line 57: | ||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | + | TBC->'''nuScenes dataset''' | |
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
<syntaxhighlight lang="bash"> | <syntaxhighlight lang="bash"> | ||
| − | + | # nuScenes Dataset | |
| − | + | └── /raid/datasets/general/nuscenes | |
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
</syntaxhighlight> | </syntaxhighlight> | ||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
Latest revision as of 21:32, 23 September 2020
Contents
Overview
The global dataset storage is intended for large, static datasets, in particular those which benefit multiple users (but feel free to also use it for your own data which only you need). Write access is very slow since it is tunneled over a slow filesystem for security and backup reasons (see below for technical details), so it will take a while until your datasets actually show up on the cluster. Read access, however, should be very fast, and might in some cases even surpass local storage.
The global storage can be easily mounted in any container on any node as a read-only volume, while you have to write to it using certain rsync commands on the master node. See below for detailed instructions. Every user has their own subdirectory within the global storage (readable by everyone, writeable only by that user). In addition, there is a user-independent directory subtree with common machine learning datasets. If you believe you have a dataset in your own subdirectory which is static and beneficial for many users, please contact me to move it to the common tree.
Writing your data to the global storage
The global storage is populated from the subdirectory "datasets/cluster" in your home directory on the CCU master node Lolth (ccu-master.inf.uni-konstanz.de, IP 134.34.224.84). If you try to login there with ssh, the shell is extremely limited for security reasons. However, you can run an rsync command to the server as follows. Assume the current directory contains a subdirectory "my_dataset" which you want to copy to the global storage on the cluster, then run
> rsync -avz --info=progress2 ./my_dataset your.username@ccu-master.inf.uni-konstanz.de:datasets/cluster/
You now have a copy of your dataset on Lolth. What happens now is that roughly every hour, the datasets on Lolth are synced to the directory "/raid/datasets/your.username" on an NFS server. This directory is exported and you can mount it into any container running on the cluster. Note that every user has read access to the whole directory tree, so you can use this method to share data between users as well. As a side effect, you now also have two backups of your data on two different machines (however in the same rack, so not really fire-proof).
You can also delete data from Lolth by ssh'ing into the machine and using rm to delete stuff in the "datasets/cluster" subdirectory. During the hourly sync, data not present here will also be deleted from the global cluster storage.
Accessing the global storage from within a container
Please see this page for an introduction on how to use the datasets.
List of datasets in global storage
Everyone, please update this list if you have any useful datasets to share. Feel free to generate additional pages on the Wiki in case a dataset needs more description, or link to your project page in the respective column (see example below).
The KITTI Vision Dataset
- Description:
Kitti contains a suite of vision tasks built using an autonomous driving platform. The full benchmark contains many tasks such as stereo, optical flow, visual odometry, etc. This dataset contains the object detection dataset, including the monocular images and bounding boxes. The dataset contains 7481 training images annotated with 3D bounding boxes. A full description of the annotations can be found in the readme of the object development kit readme on the Kitti homepage.
- Homepage: http://www.cvlibs.net/datasets/kitti/
- Benchmark: http://www.cvlibs.net/datasets/kitti/eval_object.php?obj_benchmark=3d
- Source code on TensorFlow: http://tfds.object_detection.Kitti
- Source code on PyTorch: https://github.com/poodarchu/Det3D
- Dataset size: 85 GiB
- The dataset is totally downloaded and organized as follows
# KITTI Dataset
└── /raid/datasets/general/kitti
├── training <-- 7481 train data
| ├── image_2 <-- for visualization
| ├── calib
| ├── label_2
| ├── velodyne
| └── velodyne_reduced <-- empty directory
└── testing <-- 7580 test data
├── image_2 <-- for visualization
├── calib
├── velodyne
└── velodyne_reduced <-- empty directory
TBC->nuScenes dataset
# nuScenes Dataset
└── /raid/datasets/general/nuscenes