"Kevin M. Hildebrand via slurm-users" slurm-users@lists.schedmd.com writes:
The former- jobs should run but are not. We currently have these backfill parameters set: bf_continue,bf_max_job_user=10. bf_max_job_test is the default of 500. However sdiag says the number of times bf_max_job_test has been hit is zero, so that's probably not relevant. I can try removing bf_max_job_user, but I don't think that's the issue either, as this problem also seems to affect users with few jobs in queue when a different user has all of one GPU type consumed.
Perhaps you can add more debugging in slurmctld, for instance DebugFlags=Backfill,SelectType (and possibly Gres) and increase SlurmctldDebug to debug2 or debug3. Then you might see *why* it doesn't schedule the jobs.