[slurm-users] Jobs blocking scheduling progress

Paul Edmon pedmon at cfa.harvard.edu
Tue Jul 3 19:30:49 MDT 2018


Odds are the backfill loop is not penetrating far enough into the 
queue.  Recall that slurm has two scheduling loops.  The primary is the 
faster one that only penetrates as far as it can schedule. Thus in this 
case the primary loop would stop immediately on the GPU jobs that it 
can't schedule.  Thus it would be up to the backfill loop to fill in the 
gaps.  So I would make sure that your backfill loop is actually doing 
that.  It will either be the case that the backfill loop isn't going 
deep enough into the queue to pick up the cpu jobs that can run, or it 
has adjudicated that those jobs can't run due to some vargarity in logic 
(typically because it thinks that it won't fit due to time constraints).

Anyways that's where I would start.

-Paul Edmon-


On 7/3/2018 5:22 PM, Christopher Benjamin Coffey wrote:
> Hello!
>
> We are having an issue with high priority gpu jobs blocking low priority cpu only jobs.
>
> Our cluster is setup with one partition, "all". All nodes reside in this cluster. In this all partition we have four generations of compute nodes, including gpu nodes. We do this to make use of those unused cores on the gpu nodes for compute only jobs. We schedule the various different generations, and gpu nodes by the user specifying a constraint (if they care), and a --qos=gpu / --gres=gpu:tesla:1 for gpu nodes. The gpu qos will give the jobs the highest priority in the queue, so that they can get scheduled sooner onto the limited resource that we have in gpu's. So this has worked out real nice for quite some time. But lately we've noticed that the gpu jobs are blocking the cpu only jobs. Yes, the gpu jobs have higher priority, yet, the gpu jobs can only run on a very small subset of nodes compared to the cpu only jobs. But it appears that slurm isn't taking into consideration the limited set of nodes the gpu job can run on. That’s the only possibility that I see to the gpu jobs blocking the cpu only jobs. I'm not sure if this is due to a recent slurm change, or if we just never noticed, but its definitely happening.
>
> For example, the behavior happens in the following scenario
>
> - 15 compute nodes (no gpus) are idle
> - All of the gpus are occupied
> - 1000's of low priority compute only jobs in the pending queue
> - 100's of highest priority gpu jobs in the pending queue
>
> In the above scenario, the above low priority jobs are not backfilled, or started, yet compute only nodes remain idle. If I hold the gpu jobs, the lower priority compute only jobs are then started.
>
> Anyone seen this? Am I thinking about this wrong? I would think that slurm should not be considering the nodes with no gpus to fulfill the gpu jobs.
>
> I have an idea how to fix this scenario, but I think our current config should work. The fix I am mulling over is to create a gpu partition, and place the gpu nodes into that partition. Then, use the all_partitions job submit plugin to schedule compute only jobs into both partitions. The gpu jobs would then only land in the gpu partition. I'd think that would definitely fix the issue, but maybe there is a down side. Yet, I think how we have it should be working!?
>
> Thanks for your advice!
>
> Best,
> Chris
>
>> Christopher Coffey
> High-Performance Computing
> Northern Arizona University
> 928-523-1167
>   
>





More information about the slurm-users mailing list