[slurm-users] slurmd and dynamic nodes
Brian Andrus
toomuchit at gmail.com
Fri Sep 23 16:24:26 UTC 2022
You shouldn't have to change any parameters if you have it configured in
the defaults. Just systemctl stop/start slurmd as needed.
something like:
scontrol update state=drain nodename=<node_to_change> reason="MIG reconfig"
<wait for it to be drained>
ssh <node_to_change> "systemctl stop slurmd"
<run reconfig stuff>
ssh <node_to_change> "systemctl start slurmd"
Not sure what would make you feel slurmd cannot run as a service on a
dynamic node. As long as you added the options to the systemd defaults
file for it, you should be fine (usually /etc/defaults/slurmd)
Brian
On 9/23/2022 7:40 AM, Groner, Rob wrote:
> Ya, we're still working out the mechanism for taking the node out,
> making the changes, and bringing it back. But the part I can't figure
> out is slurmd running on the remote node. What do I do with it? Do I
> run it standalone, and when I need to reconfigure, I kill -9 it and
> execute it again with the new configuration? Or what if slurmd is
> running as a service (as it does on all our non-dynamic nodes)? Do I
> stop it, change its service parameters and then restart it to
> reconfigure the node? The docs on slurm for dynamic nodes don't give
> any indication of how you handle slurmd running on the dynamic node.
> What is the preferred method?
>
> Rob
>
> ------------------------------------------------------------------------
> *From:* slurm-users <slurm-users-bounces at lists.schedmd.com> on behalf
> of Brian Andrus <toomuchit at gmail.com>
> *Sent:* Friday, September 23, 2022 10:24 AM
> *To:* slurm-users at lists.schedmd.com <slurm-users at lists.schedmd.com>
> *Subject:* Re: [slurm-users] slurmd and dynamic nodes
>
>
> You don't often get email from toomuchit at gmail.com. Learn why this is
> important <https://aka.ms/LearnAboutSenderIdentification>
>
>
>
> Just off the top of my head here.
>
> I would expect you need to have no jobs currently running on the node,
> so you could could submit a job to the node that sets the node to
> drain, does any local things needed, then exits. As part of the
> EpilogSlurmctld script, you could check for drained nodes based on
> some reason (like 'MIG reconfig') and do the head node steps there,
> with a final bit of bringing it back online.
>
>
> Or just do all those steps from a script outside slurm itself, on the
> head node. You can use ssh/pdsh to connect to a node and execute
> things there while it is out of the mix.
>
>
> Brian Andrus
>
>
> On 9/23/2022 7:09 AM, Groner, Rob wrote:
>>
>> I'm working through how to use the new dynamic node features in order
>> to take down a particular node, reconfigure it (using nvidia MIG to
>> change the number of graphic cores available) and give it back to slurm.
>>
>> I'm at the point where I can take a node out of slurm's control from
>> the master node (scontrol delete nodename....), make the nvidia-smi
>> change, and then execute slurmd on the node with the changed
>> configuration parameters. It then does show up again in the sinfo
>> output on the master node, with the correct new resources.
>>
>> What I'm not sure about is...when I want to reconfigure the dynamic
>> node AGAIN, how do I do that on the target node? I can use "scontrol
>> delete" again on the scheduler node, but on the dynamic node, slurmd
>> will still be running. Currently, for testing purposes, I just find
>> the process ID and kill -9 it. Then I change the node configuration
>> and execute "slurmd -Z --conf=...." again.
>>
>> Is there a more elegant way to change the configuration on the
>> dynamic node than by killing the existing slurmd process and starting
>> it again?
>>
>> I'll note that I tried doing everything from the master (slurmctld)
>> node, since there is an option of creating the node there with
>> "scontrol create" instead of using slurmd on the dynamic node. But
>> when i tried that, the dynamic node I created showed up in sinfo
>> output with a ~ next to it (powered off). The dynamic node docs page
>> online did not mention what, if anything, slurmd was supposed to be
>> running as on the dynamic node if attempting to handle delete and
>> create only on the master node.
>>
>> Thanks.
>>
>> Rob
>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20220923/d160561e/attachment.htm>
More information about the slurm-users
mailing list