Difference between revisions of "CCU:Perstistent storage on the Kubernetes cluster"

From Collective Computational Unit
Jump to navigation Jump to search
(Created page with "== The CephFS file system == As explained in the quick start tutorial, every user can mount certain local host paths inside their pods, which...")
 
m (Local storage on the node)
 
(16 intermediate revisions by the same user not shown)
Line 1: Line 1:
 +
 
== The CephFS file system ==
 
== The CephFS file system ==
  
As explained in the [[CCU:GPU Cluster Quick Start|quick start tutorial]], every user can mount certain local host paths inside their pods, which refer to a global distributed Ceph file system.
+
As explained in the [[CCU:GPU Cluster Quick Start|quick start tutorial]], every user can mount certain local host paths inside their pods, which refer to a global distributed Ceph file system. Reminder, the primary home directory is
  
This file system is usually very fast, but only if it is used for workloads it is designed for. Remember that it is a distributed storage, this means that metadata access is over a database and can be a bottleneck. In effect, performance breaks down dramatically if writing or accessing many small files, or having many small files in a single directory (which forces metadata to be stored on a single server).
+
<syntaxhighlight lang="bash">
 
+
/cephfs/abyss/home/<your-username>
'''TL;DR, and this is very important: when using CephFS, make sure to organize your dataset in few large files (e.g. HDF5), and not many small ones !'''
+
</syntaxhighlight>
 
 
If this is not possible for you, then you need to resort to persistent volumes residing on local storage on a single node, which for small files is orders of magnitude faster, but you are bound to a particular node (or have to duplicate the data in different local filesystems). A tutorial follows.
 
 
 
 
 
== Persistent volumes ==
 
 
 
A persistent volume in Kubernetes is a cluster resource which can be requested by a container. For this, you have to claim a persistent volume (PV) using a persistent volume claim (PVC), which you apply in your namespace. The persistent volume claim can then be mounted to directories within a container. The important point is that the PVC survives the end of the container, i.e. the data in the PV will be permanent until the PVC is released. If the PVC is mounted again to a new container, the data will still be present. A persistent volume which is bound to a claim can not be assigned to any other claim. '''If the PVC is released, the PV is also released and immediately and automatically wiped clean of all data'''. If you want to keep your data, copy it to some other permanent storage first.
 
 
 
On the cluster, there are two types of persistent volumes currently configured:
 
* Local persistent volumes
 
* Global persistent volumes
 
 
 
Note: the cluster will soon get large, fast global storage, at this point local persistent volumes will be phased out and probably not available anymore. Tensorboard monitoring should be done using service exports, as explained below, and not make use of local PVs.
 
 
 
 
 
=== Local persistent volumes ===
 
 
 
These are persistent volumes which are mapped to special folders of the host filesystem of the node. Each node exposes several persistent volumes which can be claimed. The user can not control exactly which volume is bound to a claim, but can request a minimum size. A persistent volume claim for a local PV is configured like this. Code examples can be found in the subdirectory "kubernetes/example_2" of the tutorial sample code, [[File:Kubernetes_samples.zip|Kubernetes samples]].
 
 
 
'''WARNING: Once a local persistent volume has been bound to a specific node, all pods which make use this volume are forced to also run on this node. This means you have to rely on resources (e.g. GPUs) being available on exactly that particular node.'''
 
 
 
'''NOTE: The storage class "local-ssd" which was previously used for local persistent volumes is now obsolete, since a better driver with automatic provisioning has been installed. From now on, please use "local-path" instead, which will give you a PV on the fastest local device (usually SSD/NVMe RAID). No new volumes of class "local-ssd" can be claimed.''' Please copy over your data from old PVCs if you have the opportunity, or delete old PVCs not in use anymore. As soon as there are no more PVCs of the old class in use, it will be deleted from the cluster. Also, check out "global-datasets" below, which gives you a new opportunity to store large, static datasets on a very fast device.
 
 
 
 
 
<syntaxhighlight lang="yaml">
 
apiVersion: v1
 
kind: PersistentVolumeClaim
 
metadata:
 
  # the name of the PVC, we refer to this in the container configuration
 
  name: tf-mnist-pvc
 
 
 
spec:
 
  resources:
 
    requests:
 
      # storage resource request. This PVC can only be bound to volumes which
 
      # have at least 8 GiB of storage available.
 
      storage: 8Gi
 
 
 
  # the requested storage class, see tutorial.
 
  storageClassName: local-path
 
  
  # leave these unchanged, they must match the PV type, otherwise binding will fail
+
This file system is usually quite fast, but only if it is used for workloads it is designed for. It is a distributed storage, where the filesystem metadata is stored in databases on different servers, and the actual content of the files on other ones. This means that metadata access (such as reading file attributes, or on which server to look for a specific file) can be a bottleneck. In effect, the task of reading the metadata for a small file is orders of magnitude more expensive than reading the actual contents of the file itself. This means that performance breaks down dramatically if writing or accessing many small files. In particular, having many small files in a single directory (say >10k) makes any simple filesystem operations such as directory listings take ages, and in particular automated backup jobs might run into problems.
  accessModes:
 
    - ReadWriteOnce
 
  volumeMode: Filesystem
 
</syntaxhighlight>
 
  
The following storage classes are configured in the cluster:
+
'''TL;DR, and this is very important: when using CephFS, make sure to organize your dataset in few large files (e.g. HDF5), and not many small ones ! If you really have to have individual files, then make sure they are stored in subdirectories which do not become too large. '''
  
 +
For example, if you have a million images of the form abcdef.jpg in a single directory, you better distribute them over a directory tree a/b/c/def.jpg, so that it is only 1000 files per directory.
  
 +
An interesting option if you have a dataset consisting of many small files might be to keep it in a tar archive and mount that archive using [https://github.com/mxmlnkn/ratarmount ratarmount].
  
 +
If this is not possible for you, then you need to use the local SSD storage on a single node, which for small files is orders of magnitude faster, but you are bound to a particular node (or have to duplicate the data in different local filesystems). See below for details on local filesystems.
  
When the claim is defined to your satisfaction, apply it like this:
+
== CephFS capacity and backup strategy ==
  
<syntaxhighlight lang="yaml">
+
The storage on the Ceph filesystem is quite expensive due to redundancy built in (if any server reboots or is otherwise unavailable, the others can still serve all of the data). The contents of the home directories are also backed up daily onto a backup server with a file history - if you ever accidentally overwrite or otherwise lose an extremely important file, you can contact me and check if I have an old copy in a backup.  
> kubectl apply -f pvc.yaml
 
</syntaxhighlight>
 
  
You can check on the status of this (and every other) claim:
+
Currently, there is sufficient space left, however, I kindly ask you to not keep data you do not use anymore on the Ceph filesystem for too long. In particular, please delete old checkpoints of training runs you will never need again - I have seeen people use several Terabytes for their training histories. If you still need these, please move them onto your own computers. If you really want to keep old stuff lying around on the cluster filesystem, maybe because you are not sure whether you will need it again later on, then please put it into a folder which is not backed up. For this, every user can mount the Ceph directory
  
<syntaxhighlight lang="yaml">
+
<syntaxhighlight lang="bash">
> kubectl get pvc
+
/cephfs/abyss/archive/nobackup/<your-username>
NAME          STATUS    VOLUME  CAPACITY  ACCESS MODES  STORAGECLASS  AGE
 
tf-mnist-pvc  Pending                                      local-path    11s
 
 
</syntaxhighlight>
 
</syntaxhighlight>
  
Since the claim has not been used by a container yet, it is not yet bound to a persitent volume (PV).
+
which can be used as an archive. Make sure that the directory is created if it does not exist, by specifying "type: DirectoryOrCreate".
  
=== Global persistent volumes ===
+
== Local storage on the node ==
  
In contrast, global persistent volumes are provided cluster-wide and are accessible from any node (managed internally with rook-ceph). They reside on SSDs and thus should be reasonably fast, however, depending on where the volume ends up, data will probably be transferred across the network to/from the node. Thus, they are slower than local-ssd, but leave you considerably more flexible, as they do not require pods to run on specific nodes. Also, there is no constraint on maximum size except for physical limitations. Currently, there is a total of 20 TB of cluster-wide SSD storage, which we plan to increase considerably in the near future.
+
The path for local storage for each user is
  
Compared to creating local persistent volumes, the only thing which needs to be changed is the storage class to "ceph-ssd".
+
<syntaxhighlight lang="bash">
 
+
/raid/local-data/<your-username>
<syntaxhighlight lang="yaml">
 
apiVersion: v1
 
kind: PersistentVolumeClaim
 
metadata:
 
  # the name of the PVC, we refer to this in the container configuration
 
  name: tf-mnist-global-pvc
 
 
 
spec:
 
  resources:
 
    requests:
 
      # storage resource request. This PVC can only be bound to volumes which
 
      # have at least 8 GiB of storage available.
 
      storage: 8Gi
 
 
 
  # the requested storage class, see tutorial.
 
  storageClassName: ceph-ssd
 
 
 
  # access mode is mandatory
 
  accessModes:
 
    - ReadWriteOnce
 
    - ReadOnlyMany
 
  # For me (Felix) it worked only with the additional following line:
 
  volumeMode: Filesystem
 
 
</syntaxhighlight>
 
</syntaxhighlight>
  
 +
You can mount it as a hostPath, but have to make sure that the directory is created if it does not exist, by specifying "type: DirectoryOrCreate".
  
Since anyone can mount global persistent volumes in the same namespace, they can and should be used to share datasets. The name of a PVC which contains a useful dataset should start with "dataset-" and be descriptive, so that it can easily be found by other users. Also, the root of the PVC should contain a README with informations about the dataset (at least the source and what exactly it is).  
+
The data will remain persistent on the host, but note that it also only exists on this particular host. If you need to access it again, you need to make sure the pod always ends up on the same specific node. See example below. Otherwise, write your scripts in such a way that they check for existence of the local data, and if it is not there yet, copy it over from somewhere on the internet.
  
A note on mounting. Currently (will change in the near future), ceph volumes can be either mounted ReadWrite by a single pod only, or ReadOnly by multiple pods. Thus, the workflow for a static dataset is to create the PVC, then create a pod to write all the data to it, then delete this pod and mount it read only from now on so it can be used in multiple pods.
+
'''In contrast to Ceph storage, local paths on the hosts are not backed up. You have been warned.'''
  
== Reading/writing the contents of a persistent volume ==
+
== Example ==
  
You can access a PV which is bound to a PVC by mounting it into a container. For a demonstration, we use the simple container image "ubuntu:18.04", which runs a minimalistic Ubuntu, and keep it in a very long wait after container startup.
+
The following example creates an access pod on the compute node "tiamat" which mounts the local storage as well as all your personal directories in the ceph file system:
  
<syntaxhighlight lang="yaml">
+
<syntaxhighlight lang="bash">
# Test pod to mount a PV bound to a PVC into a container
 
# Before starting this pod, apply the PVC with kubectl apply -f pvc.yaml
 
 
apiVersion: v1
 
apiVersion: v1
 
kind: Pod
 
kind: Pod
 
metadata:
 
metadata:
   name: your-username-pvc-access-pod
+
   name: storage-access-pod-tiamat
 
spec:
 
spec:
 +
  nodeSelector:
 +
    kubernetes.io/hostname: tiamat
 +
 
   containers:
 
   containers:
    - name: pvc-access-container
+
  - name: ubuntu
 
+
    image: ubuntu:20.04
      # we use a small ubuntu base to access the PVC
+
    command: ["sleep", "1d"]
      image: ubuntu:18.04
+
    resources:
      # make sure that we have some time until the container quits by itself
+
       requests:
      command: ['sleep', '6h']
+
        cpu: 100m
 
+
        memory: 100Mi
       # list of mount paths within the container which will be
+
       limits:
       # bound to persistent volumes.
+
        cpu: 1
       volumeMounts:
+
        memory: 1Gi
       - mountPath: "/mnt/pvc-mnist"
+
    volumeMounts:
         # name of the volume for this path (from the below list)
+
       - mountPath: /abyss/home
         name: pvc-mnist
+
        name: cephfs-home
 
+
        readOnly: false
 +
       - mountPath: /abyss/shared
 +
        name: cephfs-shared
 +
        readOnly: false
 +
      - mountPath: /abyss/datasets
 +
         name: cephfs-datasets
 +
        readOnly: true
 +
      - mountPath: /local
 +
         name: local-storage
 +
        readOnly: false
 
   volumes:
 
   volumes:
     # User-defined name of the persistent volume within this configuration.
+
     - name: cephfs-home
     # This can be different from the name of the PVC.
+
      hostPath:
     - name: pvc-mnist
+
        path: "/cephfs/abyss/home/<your-username>"
       persistentVolumeClaim:
+
        type: Directory
         # name of the PVC this volume binds to
+
     - name: cephfs-shared
         claimName: your-username-tf-mnist-pvc
+
      hostPath:
 +
        path: "/cephfs/abyss/shared"
 +
        type: Directory
 +
     - name: cephfs-datasets
 +
       hostPath:
 +
         path: "/cephfs/abyss/datasets"
 +
        type: Directory
 +
    - name: local-storage
 +
      hostPath:
 +
         path: "/raid/local-data/<your-username>"
 +
        type: DirectoryOrCreate
 
</syntaxhighlight>
 
</syntaxhighlight>
  
After the PVC is applied, spin up the test pod with
+
== Reading/writing to the directories in the pod ==
 
 
<syntaxhighlight lang="yaml">
 
> kubectl apply -f pvc-access-pod.yaml
 
</syntaxhighlight>
 
  
You now have several options to get data to and from the container.
+
After you have created the access pod with "kubectl apply -f <filename>.yaml", you have several options to get data to and from the container.
  
=== 1. Copying data from within the container ===
+
=== Copying data from within the container ===
  
 
You can get a root shell inside the container as usual (insert the correct pod name you used below):
 
You can get a root shell inside the container as usual (insert the correct pod name you used below):
  
 
<syntaxhighlight lang="yaml">
 
<syntaxhighlight lang="yaml">
> kubectl exec -it pvc-access-pod /bin/bash
+
> kubectl exec -it access-pod -- /bin/bash
 
</syntaxhighlight>
 
</syntaxhighlight>
  
Your pod has internet access. Thus, an option to get data to/from the pod, in particular into the persistent volume, is to use scp, which first needs to be installed inside the pod:
+
Your pod has internet access. Thus, an option to get data to/from the pod, in particular into the persistent volume, is to use scp, which first might need to be installed inside the pod:
  
 
<syntaxhighlight lang="yaml">
 
<syntaxhighlight lang="yaml">
Line 172: Line 122:
 
An even better variant would be "rsync -av" instead of scp, as this only copies files which are different or do not exist in the destination. By reversing source and destination, you can also copy data out of the container this way.
 
An even better variant would be "rsync -av" instead of scp, as this only copies files which are different or do not exist in the destination. By reversing source and destination, you can also copy data out of the container this way.
  
=== 2. Copying data from the outside ===
+
=== Copying data from your local machine ===
  
From the outside world, you can directly copy data to and from the container using kubectl cp, which has a very similar syntax as scp:
+
From the local machine which has kubectl access to the cluster, you can directly copy data to and from the container using kubectl cp, which has a very similar syntax as scp:
  
 
<syntaxhighlight lang="yaml">
 
<syntaxhighlight lang="yaml">

Latest revision as of 08:41, 4 July 2024

Contents

The CephFS file system

As explained in the quick start tutorial, every user can mount certain local host paths inside their pods, which refer to a global distributed Ceph file system. Reminder, the primary home directory is

/cephfs/abyss/home/<your-username>

This file system is usually quite fast, but only if it is used for workloads it is designed for. It is a distributed storage, where the filesystem metadata is stored in databases on different servers, and the actual content of the files on other ones. This means that metadata access (such as reading file attributes, or on which server to look for a specific file) can be a bottleneck. In effect, the task of reading the metadata for a small file is orders of magnitude more expensive than reading the actual contents of the file itself. This means that performance breaks down dramatically if writing or accessing many small files. In particular, having many small files in a single directory (say >10k) makes any simple filesystem operations such as directory listings take ages, and in particular automated backup jobs might run into problems.

TL;DR, and this is very important: when using CephFS, make sure to organize your dataset in few large files (e.g. HDF5), and not many small ones ! If you really have to have individual files, then make sure they are stored in subdirectories which do not become too large.

For example, if you have a million images of the form abcdef.jpg in a single directory, you better distribute them over a directory tree a/b/c/def.jpg, so that it is only 1000 files per directory.

An interesting option if you have a dataset consisting of many small files might be to keep it in a tar archive and mount that archive using ratarmount.

If this is not possible for you, then you need to use the local SSD storage on a single node, which for small files is orders of magnitude faster, but you are bound to a particular node (or have to duplicate the data in different local filesystems). See below for details on local filesystems.

CephFS capacity and backup strategy

The storage on the Ceph filesystem is quite expensive due to redundancy built in (if any server reboots or is otherwise unavailable, the others can still serve all of the data). The contents of the home directories are also backed up daily onto a backup server with a file history - if you ever accidentally overwrite or otherwise lose an extremely important file, you can contact me and check if I have an old copy in a backup.

Currently, there is sufficient space left, however, I kindly ask you to not keep data you do not use anymore on the Ceph filesystem for too long. In particular, please delete old checkpoints of training runs you will never need again - I have seeen people use several Terabytes for their training histories. If you still need these, please move them onto your own computers. If you really want to keep old stuff lying around on the cluster filesystem, maybe because you are not sure whether you will need it again later on, then please put it into a folder which is not backed up. For this, every user can mount the Ceph directory

/cephfs/abyss/archive/nobackup/<your-username>

which can be used as an archive. Make sure that the directory is created if it does not exist, by specifying "type: DirectoryOrCreate".

Local storage on the node

The path for local storage for each user is

/raid/local-data/<your-username>

You can mount it as a hostPath, but have to make sure that the directory is created if it does not exist, by specifying "type: DirectoryOrCreate".

The data will remain persistent on the host, but note that it also only exists on this particular host. If you need to access it again, you need to make sure the pod always ends up on the same specific node. See example below. Otherwise, write your scripts in such a way that they check for existence of the local data, and if it is not there yet, copy it over from somewhere on the internet.

In contrast to Ceph storage, local paths on the hosts are not backed up. You have been warned.

Example

The following example creates an access pod on the compute node "tiamat" which mounts the local storage as well as all your personal directories in the ceph file system:

apiVersion: v1
kind: Pod
metadata:
  name: storage-access-pod-tiamat
spec:
  nodeSelector:
    kubernetes.io/hostname: tiamat

  containers:
  - name: ubuntu
    image: ubuntu:20.04
    command: ["sleep", "1d"]
    resources:
      requests:
        cpu: 100m
        memory: 100Mi
      limits:
        cpu: 1
        memory: 1Gi
    volumeMounts:
      - mountPath: /abyss/home
        name: cephfs-home
        readOnly: false
      - mountPath: /abyss/shared
        name: cephfs-shared
        readOnly: false
      - mountPath: /abyss/datasets
        name: cephfs-datasets
        readOnly: true
      - mountPath: /local
        name: local-storage
        readOnly: false
  volumes:
    - name: cephfs-home
      hostPath:
        path: "/cephfs/abyss/home/<your-username>"
        type: Directory
    - name: cephfs-shared
      hostPath:
        path: "/cephfs/abyss/shared"
        type: Directory
    - name: cephfs-datasets
      hostPath:
        path: "/cephfs/abyss/datasets"
        type: Directory
    - name: local-storage
      hostPath:
        path: "/raid/local-data/<your-username>"
        type: DirectoryOrCreate

Reading/writing to the directories in the pod

After you have created the access pod with "kubectl apply -f <filename>.yaml", you have several options to get data to and from the container.

Copying data from within the container

You can get a root shell inside the container as usual (insert the correct pod name you used below):

> kubectl exec -it access-pod -- /bin/bash

Your pod has internet access. Thus, an option to get data to/from the pod, in particular into the persistent volume, is to use scp, which first might need to be installed inside the pod:

# apt-get update && apt install openssh-client rsync
# cd /my-pvc-mount-path
# scp your.username@external-server:/path/to/data/. ./

An even better variant would be "rsync -av" instead of scp, as this only copies files which are different or do not exist in the destination. By reversing source and destination, you can also copy data out of the container this way.

Copying data from your local machine

From the local machine which has kubectl access to the cluster, you can directly copy data to and from the container using kubectl cp, which has a very similar syntax as scp:

# to get data into the container, substitute name with correct id obtained from kubectl get pods
> kubectl cp /path/to/data/. pvc-access-pod:/my-pvc-mount/path/data
# to get data from the container
> kubectl cp pvc-access-pod:/my-pvc-mount/path/. /path/to/output/

Read up on Kubernetes "kubectl cp" documentation to check how it handles directories, it's a bit unusual and slightly different from scp.

Note: kubectl cp internally uses tar and some compression to speed up network transfer. However, this means that your access pod needs a certain amount of memory, in particular when transferring large files. If you run into "error 137" (out of memory), increase memory limits of the access pod or use scp from within the pod.