[slurm-users] Cloud nodes remain in state "alloc#"

Sun Oct 25 16:36:44 UTC 2020

I have discovered that if I launch slurmd with the "-b" flag, then my jobs
run successfully. However, this flag is documented as "Report node rebooted
when daemon restarted. Used for testing purposes.". I'd rather not rely on
something that is supposed to be only for testing purposes but perhaps this
gives a clue as to what is going wrong.

Another possible clue is that, without the above flag, slurmctld will wait
for ResumeTimeout seconds, report that the node is down, then immediately
report that the node is up and . The booted slurmd note will also receive
instruction to terminate the job that never got allocated to it.

slurmctld: node foo not resumed by ResumeTimeout(60) - marking down and
power_save
slurmctld: Killing JobId=2 on failed node foo
slurmctld: Node foo now responding
slurmctld: node_did_resp: node foo returned to service

I've also tried:

* Increasing the resume and slurmd timeouts so they are very long (but the
slurmd is easily coming up within these limits). This has no impact.
* Swapping the order in which I boot slurmd and call scontrol update. This
has no impact.
* Setting the state to Resume via scontrol update. This gives me an invalid
state transition error from ALLOCATION to RESUME
* Setting the hostname of the node via scontrol update because the node
hostname doesn't match the nodename and I have placed the nodename as an
alias in /etc/hosts on the slurmd node. This has no impact.

On Sat, 24 Oct 2020 at 23:01, Rupert Madden-Abbott <
rupert.madden.abbott at gmail.com> wrote:

> Hi,
>
> I'm using Slurm's elastic compute functionality to spin up nodes in the
> cloud, alongside a controller which is also in the cloud.
>
> When executing a job, Slurm correctly places a node into the state
> "alloc#" and calls my resume program. My resume program successfully
> provisions the cloud node and slurmd comes up without a problem.
>
> My resume program then retrieves the ip address of my cloud node and
> updates the controller as follows:
>
> scontrol update nodename=foo nodeaddr=bar
>
> And then nothing happens! The node remains in the state "alloc#" until the
> ResumeTimeout is reached at which point the controller gives up.
>
> I'm fairly confident that slurmd is able to talk to the controller because
> if I specify an incorrect hostname for the controller in my slurm.conf,
> then slurmd immediately errors on startup and exits with a message saying
> something like "unable to contact controller"
>
> What am I missing?
>
> Thanks very much in advance if anybody has any ideas!
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20201025/061b080d/attachment.htm>