[slurm-users] Jobs blocking scheduling progress

Tue Jul 3 15:22:06 MDT 2018

Hello!

We are having an issue with high priority gpu jobs blocking low priority cpu only jobs.

Our cluster is setup with one partition, "all". All nodes reside in this cluster. In this all partition we have four generations of compute nodes, including gpu nodes. We do this to make use of those unused cores on the gpu nodes for compute only jobs. We schedule the various different generations, and gpu nodes by the user specifying a constraint (if they care), and a --qos=gpu / --gres=gpu:tesla:1 for gpu nodes. The gpu qos will give the jobs the highest priority in the queue, so that they can get scheduled sooner onto the limited resource that we have in gpu's. So this has worked out real nice for quite some time. But lately we've noticed that the gpu jobs are blocking the cpu only jobs. Yes, the gpu jobs have higher priority, yet, the gpu jobs can only run on a very small subset of nodes compared to the cpu only jobs. But it appears that slurm isn't taking into consideration the limited set of nodes the gpu job can run on. That’s the only possibility that I see to the gpu jobs blocking the cpu only jobs. I'm not sure if this is due to a recent slurm change, or if we just never noticed, but its definitely happening.  

For example, the behavior happens in the following scenario

- 15 compute nodes (no gpus) are idle
- All of the gpus are occupied
- 1000's of low priority compute only jobs in the pending queue
- 100's of highest priority gpu jobs in the pending queue

In the above scenario, the above low priority jobs are not backfilled, or started, yet compute only nodes remain idle. If I hold the gpu jobs, the lower priority compute only jobs are then started.

Anyone seen this? Am I thinking about this wrong? I would think that slurm should not be considering the nodes with no gpus to fulfill the gpu jobs.

I have an idea how to fix this scenario, but I think our current config should work. The fix I am mulling over is to create a gpu partition, and place the gpu nodes into that partition. Then, use the all_partitions job submit plugin to schedule compute only jobs into both partitions. The gpu jobs would then only land in the gpu partition. I'd think that would definitely fix the issue, but maybe there is a down side. Yet, I think how we have it should be working!?

Thanks for your advice!

Best,
Chris 

—
Christopher Coffey
High-Performance Computing
Northern Arizona University
928-523-1167