[slurm-users] SLURM Elastic Compute - Unable to determine this node's NodeName
Felix Wolfheimer
f.wolfheimer at googlemail.com
Sat Jul 21 13:49:32 MDT 2018
Just tried a bit more and found the following solution which works fine for
me. When creating a new instance, I pass a small script as userdata, which
is executed on the new node automatically as part of the provisioning step.
This adds the string "-N <node-name>" with the node name requested by
slurmctld on the command line of slurmd on the node. This works fine.
---------- Forwarded message ---------
From: Felix Wolfheimer <f.wolfheimer at googlemail.com>
Date: Fr., 20. Juli 2018, 23:11
Subject: SLURM Elastic Compute - Unable to determine this node's NodeName
To: <slurm-users at schedmd.com>
Hi,
I'm trying to configure a cluster on AWS which scales automatically using
SLURM's Elastic Compute (https://slurm.schedmd.com/elastic_computing.html).
However, I can't figure out how the nodes are supposed to be registered
such that SLURM.
I've a simple setup in my slurm.conf (shared by all nodes). Only relevant
part is shown here:
# AUTOSCALING
ResumeProgram=/usr/local/sbin/virtual-cluster-scale-up
SuspendProgram=/usr/local/sbin/virtual-cluster-scale-down
SuspendTime=900
ResumeTimeout=120
SuspendTimeout=300
BatchStartTimeout=120
ResumeRate=10
SuspendRate=10
TreeWidth=24000
NodeName=compute-1-[1-254] CPUs=8 State=CLOUD
PartitionName=compute-1 Nodes=compute-1-[1-254] MaxTime=INFINITE State=UP
The problem which gives me a headache is the following:
The nodes I created from an AMI get the default AWS hostnames via DHCP.
This is something like: ip-10-0-1-x. So obviously this hostname is
different from the NodeName in slurm.conf. Once a node starts up, it starts
slurmd, finds out that it's name "ip-10-0-1-x" is not mentioned in
slurm.conf and slurmd refuses to start (Unable to determine this node's
NodeName). Of course I executed the command "scontrol update NodeName=...
NodeHostName=... NodeAdr=..." as explained in the documentation on the
master, where slurmctld is running, to map the NodeName to the
NodeHostName. But this doesn't seem to influence the behavior or slurmd.
Should slurmd be started with '-N' on the new node to set the node name
explicitly to the one expected by slurmctld, or is there something else I'm
missing?
Thanks for any help and best regards
Felix
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20180721/1a13bcac/attachment.html>
More information about the slurm-users
mailing list