[slurm-users] Cloud nodes remain in state "alloc#"

Sat Oct 24 22:01:03 UTC 2020

Hi,

I'm using Slurm's elastic compute functionality to spin up nodes in the
cloud, alongside a controller which is also in the cloud.

When executing a job, Slurm correctly places a node into the state "alloc#"
and calls my resume program. My resume program successfully provisions the
cloud node and slurmd comes up without a problem.

My resume program then retrieves the ip address of my cloud node and
updates the controller as follows:

scontrol update nodename=foo nodeaddr=bar

And then nothing happens! The node remains in the state "alloc#" until the
ResumeTimeout is reached at which point the controller gives up.

I'm fairly confident that slurmd is able to talk to the controller because
if I specify an incorrect hostname for the controller in my slurm.conf,
then slurmd immediately errors on startup and exits with a message saying
something like "unable to contact controller"

What am I missing?

Thanks very much in advance if anybody has any ideas!
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20201024/e13726e6/attachment.htm>