Cluster:Changelog
Jump to navigation
Jump to search
Contents
10.02.2022
- Zariel has been repaired (twice) and should now be operational again, please test. Taints are active for the node, so make sure to add tolerations, as shown here.
20.01.2022
- Two new nodes added which are somewhat outdated, but should still be ok for testing and less compute-intensive projects (Imp and Dretch).
- One new very powerful node added (Asmodeus, 4x A100 @ 80 GB each). It is currently configured with 8 virtual GPUs at 40 GB each, but if you ever really have need of 80 GB, you can contact me.
- Zariel removed since it currently has a hardware failure. Working with nVidia support, no ETA at the moment.
- Some of the more powerful nodes (Asmodeus and Vecna) have now been "tainted" so that they can not be used by default with the pod scheduler. The pod has to explicitly "tolerate" the taint in its configuration so these nodes can be used. Please refer to the list of compute nodes for more explanations and examples.
- Taints will also be added to other nodes, so that by default, you will only be able to be scheduled to the least powerful nodes in the cluster. Please start to update your pod configurations if you have preferred nodes.
28.12.2021
- Kubernetes version has been updated to 1.23.1. Please update your kubectl accordingly.
- Pod security infrastructure has been migrated from the deprecated PodSecurityPolicy to OPA/Gatekeeper. No changes on your side should be required if everything was configured as intended, but please inform me if there are things you should be allowed to do and can't, or things you can do which should better be forbidden.
- All GPU drivers have been updated to the most recent versions available for the respective machines. You might have to migrate to more recent versions of GPU containers. The GPU driver and CUDA version of all compute nodes are now shown on the cluster status page.
- Node Zariel is currently not available - the system update broke something and the node did not boot up. I need physical access to the server room, so earliest date to fix it is January 10th. Please be considerate with the number of GPUs you reserve.
01.02.2021
- Full cluster rebuild with Kubernetes 1.20.0
- Hostpath volumes for Ceph home directories, shared and dataset storage, and local node data.
30.11.2020
- Node Zariel has been added to the cluster.
15.07.2020
- Ceph persistent storage cluster added