[slurm-users] Jobs stuck in "completing" (CG) state

Paul Edmon pedmon at cfa.harvard.edu
Sat Oct 24 17:11:51 UTC 2020


This can happen if the underlying storage is wedged.  I would check that 
it is working properly.

Usually the only way to clear this state is either fix the stuck storage 
or reboot the node.

-Paul Edmon-

On 10/24/2020 12:22 PM, Kimera Rodgers wrote:
> I'm setting up slume on OpenHPC cluster with one master node and 5 
> compute nodes.
> When I run test jobs, the jobs completely get stuck in the CG state.
>
> Can someone help me hint on where I might have gone wrong?
>
> [root at kla-ac-ohpc-01 critical]# srun -c 8 --pty bash -i
> srun: error: slurm_receive_msgs: Socket timed out on send/recv operation
> srun: error: Task launch for 37.0 failed on node c-node3: Socket timed 
> out on send/recv operation
> srun: error: Application launch failed: Socket timed out on send/recv 
> operation
> srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
>
> [root at kla-ac-ohpc-01 critical]# squeue
>              JOBID PARTITION     NAME     USER ST       TIME  NODES 
> NODELIST(REASON)
>                 36    normal     bash     test CG       0:53    2 
> c-node[1-2]
>                 37    normal     bash     root CG       0:52    1 c-node3
>
> Thank you.
>
> Regards,
> Rodgers



More information about the slurm-users mailing list