Hi All,
I'm managing a cluster with Slurm, consisting of 4 nodes. One of the compute nodes appears to be experiencing issues. While the front node's 'squeue' command indicates that jobs are running, upon connecting to the problematic node, I observe no active processes and GPUs are not being utilized.
[sushil@ccbrc ~]$ sinfo -Nel Wed May 29 12:00:08 2024 NODELIST NODES PARTITION STATE CPUS S:C:T MEMORY TMP_DISK WEIGHT AVAIL_FE REASON gag 1 defq* mixed 48 2:24:1 370000 0 1 (null) none gag 1 glycore mixed 48 2:24:1 370000 0 1 (null) none glyco1 1 defq* completing* 128 2:64:1 500000 0 1 (null) none glyco1 1 glycore completing* 128 2:64:1 500000 0 1 (null) none glyco2 1 defq* mixed 128 2:64:1 500000 0 1 (null) none glyco2 1 glycore mixed 128 2:64:1 500000 0 1 (null) none mannose 1 defq* mixed 24 2:12:1 180000 0 1 (null) none mannose 1 glycore mixed 24 2:12:1 180000 0 1 (null) none
On glyco1 (affected node!): squeue # gets stuck sudo systemctl restart slurmd # gets stuck
I tried the following to clear the jobs stuck in CG state, but any new job appears to be stuck in a 'running' state without actually running. scontrol update nodename=glyco1 state=down reason=cg scontrol update nodename=glyco1 state=resume reason=cg
There is no I/O issue in that node, and all file systems are under 30% in use. Any advice on how to resolve this without rebooting the machine?
Best, Sushil
One of the other states — down or fail, from memory — should cause it to completely drop the job.
-- #BlackLivesMatter ____ || \UTGERS, |---------------------------*O*--------------------------- ||_// the State | Ryan Novosielski - novosirj@rutgers.edu || \ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus || \ of NJ | Office of Advanced Research Computing - MSB A555B, Newark `'
On May 29, 2024, at 13:15, Sushil Mishra via slurm-users slurm-users@lists.schedmd.com wrote:
Hi All,
I'm managing a cluster with Slurm, consisting of 4 nodes. One of the compute nodes appears to be experiencing issues. While the front node's 'squeue' command indicates that jobs are running, upon connecting to the problematic node, I observe no active processes and GPUs are not being utilized.
[sushil@ccbrc ~]$ sinfo -Nel Wed May 29 12:00:08 2024 NODELIST NODES PARTITION STATE CPUS S:C:T MEMORY TMP_DISK WEIGHT AVAIL_FE REASON gag 1 defq* mixed 48 2:24:1 370000 0 1 (null) none gag 1 glycore mixed 48 2:24:1 370000 0 1 (null) none glyco1 1 defq* completing* 128 2:64:1 500000 0 1 (null) none glyco1 1 glycore completing* 128 2:64:1 500000 0 1 (null) none glyco2 1 defq* mixed 128 2:64:1 500000 0 1 (null) none glyco2 1 glycore mixed 128 2:64:1 500000 0 1 (null) none mannose 1 defq* mixed 24 2:12:1 180000 0 1 (null) none mannose 1 glycore mixed 24 2:12:1 180000 0 1 (null) none
On glyco1 (affected node!): squeue # gets stuck sudo systemctl restart slurmd # gets stuck
I tried the following to clear the jobs stuck in CG state, but any new job appears to be stuck in a 'running' state without actually running. scontrol update nodename=glyco1 state=down reason=cg scontrol update nodename=glyco1 state=resume reason=cg
There is no I/O issue in that node, and all file systems are under 30% in use. Any advice on how to resolve this without rebooting the machine?
Best, Sushil
-- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-leave@lists.schedmd.com