[slurm-users] Effect of slurmctld and slurmdb going down on running/pending jobs

Thu Jun 24 09:26:24 UTC 2021

I thought setting partitions to DOWN will kill jobs?

Amjad - to my experience, the slurmdbd & slurmctld server can be 
rebooted with no effect on running jobs. You can't submit whilst it's 
down, and I'm not precisely sure what happens to jobs that are just 
finishing - but really the impact should be minimal.

(I've done exactly what you're needing to do - reboot so a change in 
disk size is picked up - at least once with the cluster running.)

It is absolutely safe to restart slurmctld (and slurmdbd) with jobs 
running on the cluster, that really is something that at least I do all 
the time.

Tina

On 24/06/2021 10:16, Josef Dvoracek wrote:
> hi,
> 
> just set the partitions to "DOWN" to avoid unexpected behavior for users 
> and reboot slurm(ctl|dbd)+sql box. Running jobs are from my experience 
> not affected.
> No need to drain nodes.
> 
> josef
> 
> On 24. 06. 21 0:54, Amjad Syed wrote:
>> Hello all
>> We have  a cluster  running centos  7 . Our slurm  scheduler is 
>> running on a vm  machine and  we are running out  of disk  space for /var
>>  The slurm innodb is taking most of space.  We intend to expand the 
>> vdisk for slurm server. This will require a reboot  for changes to 
>> take  effect.  Do we have to stop users  submitting  jobs by draining 
>> all partitions and then restart the server. That is slurmctld.slurmdb 
>> and mariadb? Or  will the restarting of slurm vm have  no effect on 
>> running/pending iobs?
>>
>> Sincerely
>>
>> Amjad
> 

-- 
Tina Friedrich, Advanced Research Computing Snr HPC Systems Administrator

Research Computing and Support Services
IT Services, University of Oxford
http://www.arc.ox.ac.uk http://www.it.ox.ac.uk