Hello Slurm community,

We are using slurm as the system to deploy training jobs on a large gpu cluster, but encounter a strange behavior. As new comers, we wonder if this is a known behavior. Below is some more info:

We are running a relatively older version 22.0.5
At relatively higher load, we encountered hanging. It is particularly puzzling in the following sense: assume we have nodelist1 with 6 hosts and nodelist2 with 7 hosts. We run simple ‘hostname’. Deploying on nodelist1 alone or nodelusr2 alone will be fine, but with all 13 hosts, the debug messages show that the execution hang after showing that the last task done. It then hangs for exactly 180 seconds.

Does anyone know the potential issue? We sure be happy to post more config details or debug messages.

Thank you so much!

Richard