[slurm-users] maintenance partitions?

Renfro, Michael Renfro at tntech.edu
Fri Oct 5 07:24:16 MDT 2018


A reservation overlapping with times you have the node in drain?

Drain and reserve:

# scontrol update nodename=node[037] state=drain reason=“testing"
# scontrol create reservation users=renfro reservationname='drain_test' nodes=node[037] starttime=2018-10-05T08:17:00 endtime=2018-10-05T09:00:00

Users can’t allocate anything on the drained node (as expected, and hpcshell is just a shell function to srun bash with the usual arguments):

[renfro at login ~]$ hpcshell --reservation=drain_test
srun: Required node not available (down, drained or reserved)
srun: job 135579 queued and waiting for resources
^Csrun: Job allocation 135579 has been revoked
srun: Force Terminated job 135579

Resume while reservation is in place:

# scontrol update nodename=node[037] state=resume

Users in reservation can allocate using the previously-drained node:

[renfro at login ~]$ hpcshell --reservation=drain_test
[renfro at node037 ~]$

-- 
Mike Renfro  / HPC Systems Administrator, Information Technology Services
931 372-3601 / Tennessee Tech University

> On Oct 5, 2018, at 8:06 AM, Michael Di Domenico <mdidomenico4 at gmail.com> wrote:
> 
> Is anyone on the list using maintenance partitions for broken nodes?
> If so, how are you moving nodes between partitions?
> 
> The situation with my machines at the moment, is that we have a steady
> stream of new jobs coming into the queues, but broken nodes as well.
> I'd like to fix those broken nodes and re-add them to a separate
> non-production pool so that user jobs don't match, but allow me to run
> maintenance jobs on the nodes to prove things are working before
> giving them back to the users
> 
> if i simply mark nodes with downnodes= or scontrol update state=drain,
> slurm will prevent users from new jobs, but wont allow me to run jobs
> on the nodes
> 
> Ideally, i'd like to have a prod partition and a maint partition,
> where the maint partition is set to exclusiveuser and i can set the
> status of a node in the prod partition to drain without affecting the
> node status in the maint partition.  I don't believe I can do this
> though.  I believe i have to change the slurm.conf and reconfigure to
> add/remove nodes from one partition or the other
> 
> if anyone has a better solution, i'd like to hear it.
> 



More information about the slurm-users mailing list