Dear all,
we've set up SLURM 24.05.3 on our cluster and are experiencing an issue with interactive jobs. Before, we used 21.08 and pretty much the same settings, but without these issues. We've started with a fresh DB etc.
The behavior of interactive jobs is very erratic. Sometimes they start absolutely fine, at other times they die silently in the background, while the user has to wait indefinitely. We have been unable to isolate certain
users or nodes affected by this. On a given node, one user might be able to start an interactive job, while another user at the same time isn't able to. The day after, the situation might be the other way around.
The exception are jobs that use a reservation. These start fine every time as far as we can tell. At the same time, the number of idle nodes does not seem to influence the behavior I described above.
Failed allocation on the front end:
[user1@login1 ~]$ salloc
salloc: Pending job allocation 5052052
salloc: job 5052052 queued and waiting for resources