[slurm-users] Jobs stuck in "completing" (CG) state
Kimera Rodgers
rkimerah at gmail.com
Sat Oct 24 16:22:14 UTC 2020
I'm setting up slume on OpenHPC cluster with one master node and 5 compute
nodes.
When I run test jobs, the jobs completely get stuck in the CG state.
Can someone help me hint on where I might have gone wrong?
[root at kla-ac-ohpc-01 critical]# srun -c 8 --pty bash -i
srun: error: slurm_receive_msgs: Socket timed out on send/recv operation
srun: error: Task launch for 37.0 failed on node c-node3: Socket timed out
on send/recv operation
srun: error: Application launch failed: Socket timed out on send/recv
operation
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
[root at kla-ac-ohpc-01 critical]# squeue
JOBID PARTITION NAME USER ST TIME NODES
NODELIST(REASON)
36 normal bash test CG 0:53 2
c-node[1-2]
37 normal bash root CG 0:52 1 c-node3
Thank you.
Regards,
Rodgers
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20201024/46db64a0/attachment.htm>
More information about the slurm-users
mailing list