[slurm-users] maintenance partitions?
Jeffrey Frey
frey at udel.edu
Fri Oct 5 07:14:33 MDT 2018
You could reconfigure the partition node lists on the fly using scontrol:
$ scontrol update PartitionName=regular_part1 Nodes=<node list minus r00n00>
:
$ scontrol update PartitionName=regular_partN Nodes=<node list minus r00n00>
$ scontrol update PartitionName=maint Nodes=r00n00
Should be easy enough to write a script that find the partitions containing node X, remove it, then add to partition "maint." The problem is restoring the node back to service, since you can't simply disable/down one particular node-in-a-partition.
> On Oct 5, 2018, at 9:06 AM, Michael Di Domenico <mdidomenico4 at gmail.com> wrote:
>
> Is anyone on the list using maintenance partitions for broken nodes?
> If so, how are you moving nodes between partitions?
>
> The situation with my machines at the moment, is that we have a steady
> stream of new jobs coming into the queues, but broken nodes as well.
> I'd like to fix those broken nodes and re-add them to a separate
> non-production pool so that user jobs don't match, but allow me to run
> maintenance jobs on the nodes to prove things are working before
> giving them back to the users
>
> if i simply mark nodes with downnodes= or scontrol update state=drain,
> slurm will prevent users from new jobs, but wont allow me to run jobs
> on the nodes
>
> Ideally, i'd like to have a prod partition and a maint partition,
> where the maint partition is set to exclusiveuser and i can set the
> status of a node in the prod partition to drain without affecting the
> node status in the maint partition. I don't believe I can do this
> though. I believe i have to change the slurm.conf and reconfigure to
> add/remove nodes from one partition or the other
>
> if anyone has a better solution, i'd like to hear it.
>
::::::::::::::::::::::::::::::::::::::::::::::::::::::
Jeffrey T. Frey, Ph.D.
Systems Programmer V / HPC Management
Network & Systems Services / College of Engineering
University of Delaware, Newark DE 19716
Office: (302) 831-6034 Mobile: (302) 419-4976
::::::::::::::::::::::::::::::::::::::::::::::::::::::
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20181005/d40eb169/attachment.html>
More information about the slurm-users
mailing list