[slurm-users] Cloud nodes remain in state "alloc#"
Rupert Madden-Abbott
rupert.madden.abbott at gmail.com
Sat Oct 24 22:01:03 UTC 2020
Hi,
I'm using Slurm's elastic compute functionality to spin up nodes in the
cloud, alongside a controller which is also in the cloud.
When executing a job, Slurm correctly places a node into the state "alloc#"
and calls my resume program. My resume program successfully provisions the
cloud node and slurmd comes up without a problem.
My resume program then retrieves the ip address of my cloud node and
updates the controller as follows:
scontrol update nodename=foo nodeaddr=bar
And then nothing happens! The node remains in the state "alloc#" until the
ResumeTimeout is reached at which point the controller gives up.
I'm fairly confident that slurmd is able to talk to the controller because
if I specify an incorrect hostname for the controller in my slurm.conf,
then slurmd immediately errors on startup and exits with a message saying
something like "unable to contact controller"
What am I missing?
Thanks very much in advance if anybody has any ideas!
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20201024/e13726e6/attachment.htm>
More information about the slurm-users
mailing list