[slurm-users] Rolling reboot with at most N machines down simultaneously?
Tina Friedrich
tina.friedrich at it.ox.ac.uk
Thu Aug 4 10:23:12 UTC 2022
I'm thinking something like that currently - setting up some kind of
TRES resource that limits how many are rebooted at any one time.
I usually do this sort of thing more or less manually; as in, I
generated a list of sbatch commands with the reboot job (one job per
node, specifying node name) - ordered to my liking (making sure I always
have GPUs of type X available, that sort of thing) - and then submitted
that in batches, waiting for one batch to finish before the next goes in.
Tina
On 04/08/2022 06:20, Gerhard Strangar wrote:
> Phil Chiu wrote:
>
>> - Individual slurm jobs which reboot nodes - With a for loop, I could
>> submit a reboot job for each node. But I'm not sure how to limit this so at
>> most N jobs are running simultaneously.
>
> With a fake license called reboot?
>
--
Tina Friedrich, Advanced Research Computing Snr HPC Systems Administrator
Research Computing and Support Services
IT Services, University of Oxford
http://www.arc.ox.ac.uk http://www.it.ox.ac.uk
More information about the slurm-users
mailing list