[slurm-users] Stopping new jobs but letting old ones end
Ole Holm Nielsen
Ole.H.Nielsen at fysik.dtu.dk
Tue Feb 1 07:11:30 UTC 2022
Login nodes being down doesn't affect Slurm jobs at all (except if you run
slurmctld/slurmdbd on the login node ;-)
To stop new jobs from being scheduled for running, mark all partitions
down. This is useful when recovering the cluster from a power or cooling
downtime, for example.
I wrote a handy little script "schedjobs down" available from
https://github.com/OleHolmNielsen/Slurm_tools/tree/master/jobs
This loops over all partitions in the cluster and marks them down. When
the cluster is OK again, run "schedjobs up".
/Ole
On 2/1/22 07:14, Sid Young wrote:
> Brian / Christopher, that looks like a good process, thanks guys, I will
> do some testing and let you know.
>
> if I mark a partition down and it has running jobs, what happens to those
> jobs, do they keep running?
>
>
> Sid Young
> W: https://off-grid-engineering.com <https://off-grid-engineering.com>
> W: (personal) https://sidyoung.com/ <https://sidyoung.com/>
> W: (personal) https://z900collector.wordpress.com/
> <https://z900collector.wordpress.com/>
>
>
> On Tue, Feb 1, 2022 at 3:27 PM Brian Andrus <toomuchit at gmail.com
> <mailto:toomuchit at gmail.com>> wrote:
>
> One possibility:
>
> Sounds like your concern is folks with interactive jobs from the login
> node that are running under screen/tmux.
>
> That being the case, you need running jobs to end and not allow new
> users to start tmux sessions.
>
> Definitely doing 'scontrol update state=down partition=xxxx' for each
> partition. Also:
>
> touch /etc/nologin
>
> That will prevent new logins.
>
> Send a message to all active folks
>
> wall "system going down at XX:XX, please end your sessions"
>
> Then wait for folks to drain off your login node and do your stuff.
>
> When done, remove the /etc/nologin file and folks will be able to
> login again.
>
> Brian Andrus
>
> On 1/31/2022 9:18 PM, Sid Young wrote:
>>
>>
>>
>> Sid Young
>> W: https://off-grid-engineering.com <https://off-grid-engineering.com>
>> W: (personal) https://sidyoung.com/ <https://sidyoung.com/>
>> W: (personal) https://z900collector.wordpress.com/
>> <https://z900collector.wordpress.com/>
>>
>>
>> On Tue, Feb 1, 2022 at 3:02 PM Christopher Samuel <chris at csamuel.org
>> <mailto:chris at csamuel.org>> wrote:
>>
>> On 1/31/22 4:41 pm, Sid Young wrote:
>>
>> > I need to replace a faulty DIMM chim in our login node so I
>> need to stop
>> > new jobs being kicked off while letting the old ones end.
>> >
>> > I thought I would just set all nodes to drain to stop new jobs
>> from
>> > being kicked off...
>>
>> That would basically be the way, but is there any reason why
>> compute
>> jobs shouldn't start whilst the login node is down?
>>
>>
>> My concern was to keep the running jobs going and stop new jobs, so
>> when the last running job ends,
>> I could reboot the login node knowing that any terminal windows
>> "screen"/"tmux" sessions would effectively
>> have ended as the job(s) had now ended
>>
>> I'm not sure if there was an accepted procedure or best practice way
>> to tackle shutting down the Login node for this use case.
>>
>> On the bright side I am down to two jobs left so any day now :)
More information about the slurm-users
mailing list