Difference between revisions of "Cluster:Changelog"

From Collective Computational Unit
Jump to navigation Jump to search
(Created page with "=== 28.12.2021 === * Kubernetes version has been updated to 1.23.1. Please update your kubectl accordingly. * Pod security infrastructure has been migrated from the deprecate...")
 
 
(5 intermediate revisions by the same user not shown)
Line 1: Line 1:
 +
=== 10.02.2022 ===
 +
 +
* Zariel has been repaired (twice) and should now be operational again, please test. Taints are active for the node, so make sure to add tolerations, [[Cluster:Compute_nodes|as shown here]].
 +
 +
 +
 +
=== 20.01.2022 ===
 +
 +
* Two new nodes added which are somewhat outdated, but should still be ok for testing and less compute-intensive projects (Imp and Dretch).
 +
* One new very powerful node added (Asmodeus, 4x A100 @ 80 GB each). It is currently configured with 8 virtual GPUs at 40 GB each, but if you ever really have need of 80 GB, you can contact me.
 +
* Zariel removed since it currently has a hardware failure. Working with nVidia support, no ETA at the moment.
 +
* Some of the more powerful nodes (Asmodeus and Vecna) have now been "tainted" so that they can not be used by default with the pod scheduler. The pod has to explicitly "tolerate" the taint in its configuration so these nodes can be used. Please refer to [[Cluster:Compute_nodes|the list of compute nodes]] for more explanations and examples.
 +
* Taints will also be added to other nodes, so that by default, you will only be able to be scheduled to the least powerful nodes in the cluster. Please start to update your pod configurations if you have preferred nodes.
 +
 +
 
=== 28.12.2021 ===
 
=== 28.12.2021 ===
  
 
* Kubernetes version has been updated to 1.23.1. Please update your kubectl accordingly.
 
* Kubernetes version has been updated to 1.23.1. Please update your kubectl accordingly.
 
* Pod security infrastructure has been migrated from the deprecated PodSecurityPolicy to OPA/Gatekeeper. No changes on your side should be required if everything was configured as intended, but please inform me if there are things you should be allowed to do and can't, or things you can do which should better be forbidden.
 
* Pod security infrastructure has been migrated from the deprecated PodSecurityPolicy to OPA/Gatekeeper. No changes on your side should be required if everything was configured as intended, but please inform me if there are things you should be allowed to do and can't, or things you can do which should better be forbidden.
* All GPU drivers have been updated to the most recent versions available for the respective machines. You might have to migrate to more recent versions of GPU containers. The GPU driver and CUDA version of a given machine will soon be shown on the cluster status page, a respective update to the logging code is in the works.
+
* All GPU drivers have been updated to the most recent versions available for the respective machines. You might have to migrate to more recent versions of GPU containers. The GPU driver and CUDA version of all compute nodes are now shown on the cluster status page.
 
* Node Zariel is currently not available - the system update broke something and the node did not boot up. I need physical access to the server room, so earliest date to fix it is January 10th. Please be considerate with the number of GPUs you reserve.
 
* Node Zariel is currently not available - the system update broke something and the node did not boot up. I need physical access to the server room, so earliest date to fix it is January 10th. Please be considerate with the number of GPUs you reserve.
 +
 +
 +
=== 01.02.2021 ===
 +
 +
* Full cluster rebuild with Kubernetes 1.20.0
 +
* Hostpath volumes for Ceph home directories, shared and dataset storage, and local node data.
  
  
Line 15: Line 36:
  
 
* Ceph persistent storage cluster added
 
* Ceph persistent storage cluster added
* Hostpath volumes for home directories, shared and dataset storage, local node data.
 

Latest revision as of 23:12, 9 February 2022

Contents

10.02.2022

  • Zariel has been repaired (twice) and should now be operational again, please test. Taints are active for the node, so make sure to add tolerations, as shown here.


20.01.2022

  • Two new nodes added which are somewhat outdated, but should still be ok for testing and less compute-intensive projects (Imp and Dretch).
  • One new very powerful node added (Asmodeus, 4x A100 @ 80 GB each). It is currently configured with 8 virtual GPUs at 40 GB each, but if you ever really have need of 80 GB, you can contact me.
  • Zariel removed since it currently has a hardware failure. Working with nVidia support, no ETA at the moment.
  • Some of the more powerful nodes (Asmodeus and Vecna) have now been "tainted" so that they can not be used by default with the pod scheduler. The pod has to explicitly "tolerate" the taint in its configuration so these nodes can be used. Please refer to the list of compute nodes for more explanations and examples.
  • Taints will also be added to other nodes, so that by default, you will only be able to be scheduled to the least powerful nodes in the cluster. Please start to update your pod configurations if you have preferred nodes.


28.12.2021

  • Kubernetes version has been updated to 1.23.1. Please update your kubectl accordingly.
  • Pod security infrastructure has been migrated from the deprecated PodSecurityPolicy to OPA/Gatekeeper. No changes on your side should be required if everything was configured as intended, but please inform me if there are things you should be allowed to do and can't, or things you can do which should better be forbidden.
  • All GPU drivers have been updated to the most recent versions available for the respective machines. You might have to migrate to more recent versions of GPU containers. The GPU driver and CUDA version of all compute nodes are now shown on the cluster status page.
  • Node Zariel is currently not available - the system update broke something and the node did not boot up. I need physical access to the server room, so earliest date to fix it is January 10th. Please be considerate with the number of GPUs you reserve.


01.02.2021

  • Full cluster rebuild with Kubernetes 1.20.0
  • Hostpath volumes for Ceph home directories, shared and dataset storage, and local node data.


30.11.2020

  • Node Zariel has been added to the cluster.


15.07.2020

  • Ceph persistent storage cluster added