Jobs showing running but not running - slurm-users

29 May 2024


      Hi All,
I'm managing a cluster with Slurm, consisting of 4 nodes. One of the
compute nodes appears to be experiencing issues. While the front node's
'squeue' command indicates that jobs are running, upon connecting to the
problematic node, I observe no active processes and GPUs are not being
utilized.
[sushil@ccbrc ~]$ sinfo -Nel
Wed May 29 12:00:08 2024
NODELIST   NODES PARTITION       STATE CPUS    S:C:T MEMORY TMP_DISK WEIGHT
AVAIL_FE REASON
gag            1     defq*       mixed 48     2:24:1 370000        0      1
  (null) none
gag            1   glycore       mixed 48     2:24:1 370000        0      1
  (null) none
glyco1         1     defq* completing* 128    2:64:1 500000        0      1
  (null) none
glyco1         1   glycore completing* 128    2:64:1 500000        0      1
  (null) none
glyco2         1     defq*       mixed 128    2:64:1 500000        0      1
  (null) none
glyco2         1   glycore       mixed 128    2:64:1 500000        0      1
  (null) none
mannose        1     defq*       mixed 24     2:12:1 180000        0      1
  (null) none
mannose        1   glycore       mixed 24     2:12:1 180000        0      1
  (null) none
On glyco1 (affected node!):
squeue # gets stuck
sudo systemctl restart slurmd  # gets stuck
I tried the following to clear the jobs stuck in CG state, but any new job
appears to be stuck in a 'running' state without actually running.
scontrol update nodename=glyco1 state=down reason=cg
scontrol update nodename=glyco1 state=resume reason=cg
There is no I/O issue in that node, and all file systems are under 30% in
use.  Any advice on how to resolve this without rebooting the machine?
Best,
Sushil