[slurm-users] Compute node OS and firmware updates
Ole Holm Nielsen
Ole.H.Nielsen at fysik.dtu.dk
Thu Aug 6 18:40:31 UTC 2020
Regarding the question of methods for Slurm compute node OS and firmware
updates, we have for a long time used rolling updates while the cluster
is in full production, so that we do not waste any resources. When
entire partitions are upgraded in this way, there is no risk of starting
new jobs on nodes with differing states of OS and firmware, while
running jobs continue on the not-yet-updated nodes.
The basic idea (which was provided by Niels Carl Hansen, ncwh -at-
cscaa.dk) is to run a crontab script "update.sh" whenever a node is
rebooted. Use scontrol to reboot the nodes as they become idle, thereby
performing the updates that you want. Remove the crontab job as part of
the update.sh script.
The update.sh script and instructions for usage are in:
https://github.com/OleHolmNielsen/Slurm_tools/tree/master/nodes
Comments are welcome.
/Ole
--
Ole Holm Nielsen
PhD, Senior HPC Officer
Department of Physics, Technical University of Denmark
More information about the slurm-users
mailing list