[slurm-users] slurmd and dynamic nodes

Brian Andrus toomuchit at gmail.com
Fri Sep 23 16:24:26 UTC 2022


You shouldn't have to change any parameters if you have it configured in 
the defaults. Just systemctl stop/start slurmd as needed.


something like:

scontrol update state=drain nodename=<node_to_change> reason="MIG reconfig"

<wait for it to be drained>

ssh <node_to_change> "systemctl stop slurmd"

<run reconfig stuff>

ssh <node_to_change> "systemctl start slurmd"


Not sure what would make you feel slurmd cannot run as a service on a 
dynamic node. As long as you added the options to the systemd defaults 
file for it, you should be fine (usually /etc/defaults/slurmd)


Brian


On 9/23/2022 7:40 AM, Groner, Rob wrote:
> Ya, we're still working out the mechanism for taking the node out, 
> making the changes, and bringing it back. But the part I can't figure 
> out is slurmd running on the remote node.  What do I do with it?  Do I 
> run it standalone, and when I need to reconfigure, I kill -9 it and 
> execute it again with the new configuration?  Or what if slurmd is 
> running as a service (as it does on all our non-dynamic nodes)?  Do I 
> stop it, change its service parameters and then restart it to 
> reconfigure the node? The docs on slurm for dynamic nodes don't give 
> any indication of how you handle slurmd running on the dynamic node.  
> What is the preferred method?
>
> Rob
>
> ------------------------------------------------------------------------
> *From:* slurm-users <slurm-users-bounces at lists.schedmd.com> on behalf 
> of Brian Andrus <toomuchit at gmail.com>
> *Sent:* Friday, September 23, 2022 10:24 AM
> *To:* slurm-users at lists.schedmd.com <slurm-users at lists.schedmd.com>
> *Subject:* Re: [slurm-users] slurmd and dynamic nodes
>
> 	
> You don't often get email from toomuchit at gmail.com. Learn why this is 
> important <https://aka.ms/LearnAboutSenderIdentification>
> 	
>
>
> Just off the top of my head here.
>
> I would expect you need to have no jobs currently running on the node, 
> so you could could submit a job to the node that sets the node to 
> drain, does any local things needed, then exits. As part of the 
> EpilogSlurmctld script, you could check for drained nodes based on 
> some reason (like 'MIG reconfig') and do the head node steps there, 
> with a final bit of bringing it back online.
>
>
> Or just do all those steps from a script outside slurm itself, on the 
> head node. You can use ssh/pdsh to connect to a node and execute 
> things there while it is out of the mix.
>
>
> Brian Andrus
>
>
> On 9/23/2022 7:09 AM, Groner, Rob wrote:
>>
>> I'm working through how to use the new dynamic node features in order 
>> to take down a particular node, reconfigure it (using nvidia MIG to 
>> change the number of graphic cores available) and give it back to slurm.
>>
>> I'm at the point where I can take a node out of slurm's control from 
>> the master node (scontrol delete nodename....), make the nvidia-smi 
>> change, and then execute slurmd on the node with the changed 
>> configuration parameters.  It then does show up again in the sinfo 
>> output on the master node, with the correct new resources.
>>
>> What I'm not sure about is...when I want to reconfigure the dynamic 
>> node AGAIN, how do I do that on the target node?  I can use "scontrol 
>> delete" again on the scheduler node, but on the dynamic node, slurmd 
>> will still be running. Currently, for testing purposes, I just find 
>> the process ID and kill -9 it.  Then I change the node configuration 
>> and execute "slurmd -Z --conf=...." again.
>>
>> Is there a more elegant way to change the configuration on the 
>> dynamic node than by killing the existing slurmd process and starting 
>> it again?
>>
>> I'll note that I tried doing everything from the master (slurmctld) 
>> node, since there is an option of creating the node there with 
>> "scontrol create" instead of using slurmd on the dynamic node.  But 
>> when i tried that, the dynamic node I created showed up in sinfo 
>> output with a ~ next to it (powered off).  The dynamic node docs page 
>> online did not mention what, if anything, slurmd was supposed to be 
>> running as on the dynamic node if attempting to handle delete and 
>> create only on the master node.
>>
>> Thanks.
>>
>> Rob
>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20220923/d160561e/attachment.htm>


More information about the slurm-users mailing list