[slurm-users] maintenance partitions?
Renfro, Michael
Renfro at tntech.edu
Fri Oct 5 07:24:16 MDT 2018
A reservation overlapping with times you have the node in drain?
Drain and reserve:
# scontrol update nodename=node[037] state=drain reason=“testing"
# scontrol create reservation users=renfro reservationname='drain_test' nodes=node[037] starttime=2018-10-05T08:17:00 endtime=2018-10-05T09:00:00
Users can’t allocate anything on the drained node (as expected, and hpcshell is just a shell function to srun bash with the usual arguments):
[renfro at login ~]$ hpcshell --reservation=drain_test
srun: Required node not available (down, drained or reserved)
srun: job 135579 queued and waiting for resources
^Csrun: Job allocation 135579 has been revoked
srun: Force Terminated job 135579
Resume while reservation is in place:
# scontrol update nodename=node[037] state=resume
Users in reservation can allocate using the previously-drained node:
[renfro at login ~]$ hpcshell --reservation=drain_test
[renfro at node037 ~]$
--
Mike Renfro / HPC Systems Administrator, Information Technology Services
931 372-3601 / Tennessee Tech University
> On Oct 5, 2018, at 8:06 AM, Michael Di Domenico <mdidomenico4 at gmail.com> wrote:
>
> Is anyone on the list using maintenance partitions for broken nodes?
> If so, how are you moving nodes between partitions?
>
> The situation with my machines at the moment, is that we have a steady
> stream of new jobs coming into the queues, but broken nodes as well.
> I'd like to fix those broken nodes and re-add them to a separate
> non-production pool so that user jobs don't match, but allow me to run
> maintenance jobs on the nodes to prove things are working before
> giving them back to the users
>
> if i simply mark nodes with downnodes= or scontrol update state=drain,
> slurm will prevent users from new jobs, but wont allow me to run jobs
> on the nodes
>
> Ideally, i'd like to have a prod partition and a maint partition,
> where the maint partition is set to exclusiveuser and i can set the
> status of a node in the prod partition to drain without affecting the
> node status in the maint partition. I don't believe I can do this
> though. I believe i have to change the slurm.conf and reconfigure to
> add/remove nodes from one partition or the other
>
> if anyone has a better solution, i'd like to hear it.
>
More information about the slurm-users
mailing list