[slurm-users] Rolling reboot with at most N machines down simultaneously?

Thu Aug 4 11:53:34 UTC 2022

...job dependencies are also an option, thinking about this. You could 
carve it up into X 'sets' of N nodes, with node-specific reboot jobs 
that depend on the previous job in the same 'N' to finish.

Tina

On 04/08/2022 11:23, Tina Friedrich wrote:
> I'm thinking something like that currently - setting up some kind of 
> TRES resource that limits how many are rebooted at any one time.
> 
> I usually do this sort of thing more or less manually; as in, I 
> generated a list of sbatch commands with the reboot job (one job per 
> node, specifying node name) - ordered to my liking (making sure I always 
> have GPUs of type X available, that sort of thing) - and then submitted 
> that in batches, waiting for one batch to finish before the next goes in.
> 
> Tina
> 
> On 04/08/2022 06:20, Gerhard Strangar wrote:
>> Phil Chiu wrote:
>>
>>>     - Individual slurm jobs which reboot nodes - With a for loop, I 
>>> could
>>>     submit a reboot job for each node. But I'm not sure how to limit 
>>> this so at
>>>     most N jobs are running simultaneously.
>>
>> With a fake license called reboot?
>>
> 

-- 
Tina Friedrich, Advanced Research Computing Snr HPC Systems Administrator

Research Computing and Support Services
IT Services, University of Oxford
http://www.arc.ox.ac.uk http://www.it.ox.ac.uk