[slurm-users] Compute node OS and firmware updates

Ole Holm Nielsen Ole.H.Nielsen at fysik.dtu.dk
Thu Aug 6 18:40:31 UTC 2020


Regarding the question of methods for Slurm compute node OS and firmware 
updates, we have for a long time used rolling updates while the cluster 
is in full production, so that we do not waste any resources.  When 
entire partitions are upgraded in this way, there is no risk of starting 
new jobs on nodes with differing states of OS and firmware, while 
running jobs continue on the not-yet-updated nodes.

The basic idea (which was provided by Niels Carl Hansen, ncwh -at- 
cscaa.dk) is to run a crontab script "update.sh" whenever a node is 
rebooted.  Use scontrol to reboot the nodes as they become idle, thereby 
performing the updates that you want.  Remove the crontab job as part of 
the update.sh script.

The update.sh script and instructions for usage are in:
https://github.com/OleHolmNielsen/Slurm_tools/tree/master/nodes

Comments are welcome.

/Ole

-- 
Ole Holm Nielsen
PhD, Senior HPC Officer
Department of Physics, Technical University of Denmark



More information about the slurm-users mailing list