<div dir="ltr">
<div> Just stumbled on an issue which kicks in occasionally when Slurm
starts/creates instances using the power/cloud plugin. Here is what
happens:</div><div><br></div><div>I'm
using the Slurm Power/Cloud plugin to create compute instances on
demand. Occasionally it happens that I run into the following situation
when new instances are created/nodes are started:<br></div><div><br></div><div>
The slurmd on the new instance correctly connects to the slurmctld
without any error. The job in the queue assigned to the created node
(which triggered the node start) changes from the "configuring (CF)"
into the "running (R)" state. In the slurmd logfile on the created node,
I can find an entry saying that the job is started. However, slurmd
never creates any child process for the job and the node stays idle.
Slurm does never change the state for this job (waited a few hours).
It's always listed as "running" but on the node there's nothing started
(there's just an idle slurmd process running). This only happens for
jobs which triggered a node start/instance creation. Once the instance
is up and running, this issue is gone and jobs submitted then don't have this issue. Has anyone seen this issue? I'll
try a bit more and post additional information, if I can find out more. </div>
</div>