[slurm-users] Stopping new jobs but letting old ones end

Tue Feb 1 07:19:30 UTC 2022

One ting to be aware about when setting partition states to down:

* Setting partition state=down will be reset if slurmctld is restarted.

Read the slurmctld man-page under the -R parameter.  So it's better not to 
restart slurmctld during the downtime.

/Ole

On 2/1/22 08:11, Ole Holm Nielsen wrote:
> Login nodes being down doesn't affect Slurm jobs at all (except if you run 
> slurmctld/slurmdbd on the login node ;-)
> 
> To stop new jobs from being scheduled for running, mark all partitions 
> down.  This is useful when recovering the cluster from a power or cooling 
> downtime, for example.
> 
> I wrote a handy little script "schedjobs down" available from
> https://github.com/OleHolmNielsen/Slurm_tools/tree/master/jobs
> This loops over all partitions in the cluster and marks them down.  When 
> the cluster is OK again, run "schedjobs up".
> 
> /Ole
> 
> On 2/1/22 07:14, Sid Young wrote:
>> Brian / Christopher, that looks like a good process, thanks guys, I will 
>> do some testing and let you know.
>>
>> if I mark a partition down and it has running jobs, what happens to 
>> those jobs, do they keep running?
>>
>>
>> Sid Young
>> W: https://off-grid-engineering.com <https://off-grid-engineering.com>
>> W: (personal) https://sidyoung.com/ <https://sidyoung.com/>
>> W: (personal) https://z900collector.wordpress.com/ 
>> <https://z900collector.wordpress.com/>
>>
>>
>> On Tue, Feb 1, 2022 at 3:27 PM Brian Andrus <toomuchit at gmail.com 
>> <mailto:toomuchit at gmail.com>> wrote:
>>
>>     One possibility:
>>
>>     Sounds like your concern is folks with interactive jobs from the login
>>     node that are running under screen/tmux.
>>
>>     That being the case, you need running jobs to end and not allow new
>>     users to start tmux sessions.
>>
>>     Definitely doing 'scontrol update state=down partition=xxxx' for each
>>     partition. Also:
>>
>>     touch /etc/nologin
>>
>>     That will prevent new logins.
>>
>>     Send a message to all active folks
>>
>>     wall "system going down at XX:XX, please end your sessions"
>>
>>     Then wait for folks to drain off your login node and do your stuff.
>>
>>     When done, remove the /etc/nologin file and folks will be able to
>>     login again.
>>
>>     Brian Andrus
>>
>>     On 1/31/2022 9:18 PM, Sid Young wrote:
>>>
>>>
>>>
>>>     Sid Young
>>>     W: https://off-grid-engineering.com <https://off-grid-engineering.com>
>>>     W: (personal) https://sidyoung.com/ <https://sidyoung.com/>
>>>     W: (personal) https://z900collector.wordpress.com/
>>>     <https://z900collector.wordpress.com/>
>>>
>>>
>>>     On Tue, Feb 1, 2022 at 3:02 PM Christopher Samuel <chris at csamuel.org
>>>     <mailto:chris at csamuel.org>> wrote:
>>>
>>>         On 1/31/22 4:41 pm, Sid Young wrote:
>>>
>>>         > I need to replace a faulty DIMM chim in our login node so I
>>>         need to stop
>>>         > new jobs being kicked off while letting the old ones end.
>>>         >
>>>         > I thought I would just set all nodes to drain to stop new jobs
>>>         from
>>>         > being kicked off...
>>>
>>>         That would basically be the way, but is there any reason why
>>>         compute
>>>         jobs shouldn't start whilst the login node is down?
>>>
>>>
>>>     My concern was to keep the running jobs going and stop new jobs, so
>>>     when the last running job ends,
>>>      I could reboot the login node knowing that any terminal windows
>>>     "screen"/"tmux" sessions would effectively
>>>     have ended as the job(s) had now ended
>>>
>>>     I'm not sure if there was an accepted procedure or best practice way
>>>     to tackle shutting down the Login node for this use case.
>>>
>>>     On the bright side I am down to two jobs left so any day now :)