[slurm-users] SLURM Elastic Compute - Unable to determine this node's NodeName

Felix Wolfheimer f.wolfheimer at googlemail.com
Fri Jul 20 15:11:27 MDT 2018


Hi,

I'm trying to configure a cluster on AWS which scales automatically using
SLURM's Elastic Compute (https://slurm.schedmd.com/elastic_computing.html).
However, I can't figure out how the nodes are supposed to be registered
such that SLURM.

I've a simple setup in my slurm.conf (shared by all nodes). Only relevant
part is shown here:

# AUTOSCALING
ResumeProgram=/usr/local/sbin/virtual-cluster-scale-up
SuspendProgram=/usr/local/sbin/virtual-cluster-scale-down
SuspendTime=900
ResumeTimeout=120
SuspendTimeout=300
BatchStartTimeout=120
ResumeRate=10
SuspendRate=10
TreeWidth=24000

NodeName=compute-1-[1-254] CPUs=8 State=CLOUD
PartitionName=compute-1 Nodes=compute-1-[1-254] MaxTime=INFINITE State=UP

The problem which gives me a headache is the following:
The nodes I created from an AMI get the default AWS hostnames via DHCP.
This is something like: ip-10-0-1-x. So obviously this hostname is
different from the NodeName in slurm.conf. Once a node starts up, it starts
slurmd, finds out that it's name "ip-10-0-1-x" is not mentioned in
slurm.conf and slurmd refuses to start (Unable to determine this node's
NodeName). Of course I executed the command "scontrol update NodeName=...
NodeHostName=... NodeAdr=..." as explained in the documentation on the
master, where slurmctld is running, to map the NodeName to the
NodeHostName. But this doesn't seem to influence the behavior or slurmd.
Should slurmd be started with '-N' on the new node to set the node name
explicitly to the one expected by slurmctld, or is there something else I'm
missing?

Thanks for any help and best regards

Felix
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20180720/49cd2546/attachment.html>


More information about the slurm-users mailing list