Dear all,
we've set up SLURM 24.05.3 on our cluster and are experiencing an issue with interactive jobs. Before, we used 21.08 and pretty much the same settings, but without these issues. We've started with a fresh DB etc.
The behavior of interactive jobs is very erratic. Sometimes they start absolutely fine, at other times they die silently in the background, while the user has to wait indefinitely. We have been unable to isolate certain users or nodes affected by this. On a given node, one user might be able to start an interactive job, while another user at the same time isn't able to. The day after, the situation might be the other way around.
The exception are jobs that use a reservation. These start fine every time as far as we can tell. At the same time, the number of idle nodes does not seem to influence the behavior I described above.
Failed allocation on the front end: [user1@login1 ~]$ salloc salloc: Pending job allocation 5052052 salloc: job 5052052 queued and waiting for resources
The same job on the backend: 2024-10-14 11:41:57.680 slurmctld: _job_complete: JobId=5052052 done 2024-10-14 11:41:57.678 slurmctld: _job_complete: JobId=5052052 WEXITSTATUS 1 2024-10-14 11:41:57.678 slurmctld: Killing interactive JobId=5052052: Communication connection failure 2024-10-14 11:41:46.666 slurmctld: sched/backfill: _start_job: Started JobId=5052052 in devel on m02n01 2024-10-14 11:41:30.096 slurmctld: sched: _slurm_rpc_allocate_resources JobId=5052052 NodeList=(null) usec=6258
Raising the debug level has not brought additional information. We were hoping, that one of you might be able to provide some insight into what the next steps in troubleshooting might be.
Best regards,
Onno