<div dir="auto">Just tried a bit more and found the following solution which works fine for me. When creating a new instance, I pass a small script as userdata, which is executed on the new node automatically as part of the provisioning step. This adds the string "-N <node-name>" with the node name requested by slurmctld on the command line of slurmd on the node. This works fine.</div><br><div class="gmail_quote"><div dir="ltr">---------- Forwarded message ---------<br>From: <strong class="gmail_sendername" dir="auto">Felix Wolfheimer</strong> <span dir="ltr"><<a href="mailto:f.wolfheimer@googlemail.com">f.wolfheimer@googlemail.com</a>></span><br>Date: Fr., 20. Juli 2018, 23:11<br>Subject: SLURM Elastic Compute - Unable to determine this node's NodeName<br>To: <<a href="mailto:slurm-users@schedmd.com">slurm-users@schedmd.com</a>><br></div><br><br><div dir="ltr">Hi,<div><br></div><div>I'm trying to configure a cluster on AWS which scales automatically using SLURM's Elastic Compute (<a href="https://slurm.schedmd.com/elastic_computing.html" target="_blank" rel="noreferrer">https://slurm.schedmd.com/elastic_computing.html</a>). However, I can't figure out how the nodes are supposed to be registered such that SLURM. </div><div><br></div><div>I've a simple setup in my slurm.conf (shared by all nodes). Only relevant part is shown here:</div><div><br></div><div><div># AUTOSCALING</div><div>ResumeProgram=/usr/local/sbin/virtual-cluster-scale-up</div><div>SuspendProgram=/usr/local/sbin/virtual-cluster-scale-down</div><div>SuspendTime=900</div><div>ResumeTimeout=120</div><div>SuspendTimeout=300</div><div>BatchStartTimeout=120</div><div>ResumeRate=10</div><div>SuspendRate=10</div><div>TreeWidth=24000</div></div><div><br></div><div><div>NodeName=compute-1-[1-254] CPUs=8 State=CLOUD</div><div>PartitionName=compute-1 Nodes=compute-1-[1-254] MaxTime=INFINITE State=UP</div></div><div><br></div><div>The problem which gives me a headache is the following:</div><div>The nodes I created from an AMI get the default AWS hostnames via DHCP. This is something like: ip-10-0-1-x. So obviously this hostname is different from the NodeName in slurm.conf. Once a node starts up, it starts slurmd, finds out that it's name "ip-10-0-1-x" is not mentioned in slurm.conf and slurmd refuses to start (Unable to determine this node's NodeName). Of course I executed the command "scontrol update NodeName=... NodeHostName=... NodeAdr=..." as explained in the documentation on the master, where slurmctld is running, to map the NodeName to the NodeHostName. But this doesn't seem to influence the behavior or slurmd. Should slurmd be started with '-N' on the new node to set the node name explicitly to the one expected by slurmctld, or is there something else I'm missing?</div><div><br></div><div>Thanks for any help and best regards</div><div><br></div><div>Felix </div><div><br></div><div> </div></div>
</div>