[slurm-users] Power/Cloud Plugin - Race Condition after Node Start - Wrong Job State

Felix Wolfheimer f.wolfheimer at googlemail.com
Mon Sep 2 11:28:46 UTC 2019


 Just stumbled on an issue which kicks in occasionally when Slurm
starts/creates instances using the power/cloud plugin. Here is what happens:

I'm using the Slurm Power/Cloud plugin to create compute instances on
demand. Occasionally it happens that I run into the following situation
when new instances are created/nodes are started:

The slurmd on the new instance correctly connects to the slurmctld without
any error. The job in the queue assigned to the created node (which
triggered the node start) changes from the "configuring (CF)" into the
"running (R)" state. In the slurmd logfile on the created node, I can find
an entry saying that the job is started. However, slurmd never creates any
child process for the job and the node stays idle. Slurm does never change
the state for this job (waited a few hours). It's always listed as
"running" but on the node there's nothing started (there's just an idle
slurmd process running). This only happens for jobs which triggered a node
start/instance creation. Once the instance is up and running, this issue is
gone and jobs submitted then don't have this issue. Has anyone seen this
issue? I'll try a bit more and post additional information, if I can find
out more.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20190902/2172c250/attachment.htm>


More information about the slurm-users mailing list