[slurm-users] slurmd and dynamic nodes

Brian Andrus toomuchit at gmail.com
Fri Sep 23 14:24:08 UTC 2022


Just off the top of my head here.

I would expect you need to have no jobs currently running on the node, 
so you could could submit a job to the node that sets the node to drain, 
does any local things needed, then exits. As part of the EpilogSlurmctld 
script, you could check for drained nodes based on some reason (like 
'MIG reconfig') and do the head node steps there, with a final bit of 
bringing it back online.


Or just do all those steps from a script outside slurm itself, on the 
head node. You can use ssh/pdsh to connect to a node and execute things 
there while it is out of the mix.


Brian Andrus


On 9/23/2022 7:09 AM, Groner, Rob wrote:
>
> I'm working through how to use the new dynamic node features in order 
> to take down a particular node, reconfigure it (using nvidia MIG to 
> change the number of graphic cores available) and give it back to slurm.
>
> I'm at the point where I can take a node out of slurm's control from 
> the master node (scontrol delete nodename....), make the nvidia-smi 
> change, and then execute slurmd on the node with the changed 
> configuration parameters.  It then does show up again in the sinfo 
> output on the master node, with the correct new resources.
>
> What I'm not sure about is...when I want to reconfigure the dynamic 
> node AGAIN, how do I do that on the target node?  I can use "scontrol 
> delete" again on the scheduler node, but on the dynamic node, slurmd 
> will still be running. Currently, for testing purposes, I just find 
> the process ID and kill -9 it.  Then I change the node configuration 
> and execute "slurmd -Z --conf=...." again.
>
> Is there a more elegant way to change the configuration on the dynamic 
> node than by killing the existing slurmd process and starting it again?
>
> I'll note that I tried doing everything from the master (slurmctld) 
> node, since there is an option of creating the node there with 
> "scontrol create" instead of using slurmd on the dynamic node.  But 
> when i tried that, the dynamic node I created showed up in sinfo 
> output with a ~ next to it (powered off).  The dynamic node docs page 
> online did not mention what, if anything, slurmd was supposed to be 
> running as on the dynamic node if attempting to handle delete and 
> create only on the master node.
>
> Thanks.
>
> Rob
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20220923/e6d9e1be/attachment.htm>


More information about the slurm-users mailing list