[slurm-users] Jobs blocking scheduling progress

Wed Jul 4 02:08:41 MDT 2018

Hi,

As Paul mentioned, we once encountered a starvation issue with the
backfill algorithm and since set up the bf_window to match the maximum
running time of all the partitions. This could be the case here.

Also make sure that indeed the jobs can run on the non-gpu nodes (we
constantly encounter users who blame the queuing system when they
actually requested too much memory/cpu/time so they could only run on
a couple of busy nodes).

Regarding an alternative, I can suggest using the "Weight" option of
the nodes instead of or in addition to your qos. keeping the un-gpu-ed
nodes with higher priority will cause cpu only jobs to try and run
there first, and only if they can't will resort to the lower priority
gpu nodes.

    Yair.

On Wed, Jul 4, 2018 at 4:30 AM, Paul Edmon <pedmon at cfa.harvard.edu> wrote:
> Odds are the backfill loop is not penetrating far enough into the queue.
> Recall that slurm has two scheduling loops.  The primary is the faster one
> that only penetrates as far as it can schedule. Thus in this case the
> primary loop would stop immediately on the GPU jobs that it can't schedule.
> Thus it would be up to the backfill loop to fill in the gaps.  So I would
> make sure that your backfill loop is actually doing that.  It will either be
> the case that the backfill loop isn't going deep enough into the queue to
> pick up the cpu jobs that can run, or it has adjudicated that those jobs
> can't run due to some vargarity in logic (typically because it thinks that
> it won't fit due to time constraints).
>
> Anyways that's where I would start.
>
> -Paul Edmon-
>
>
>
> On 7/3/2018 5:22 PM, Christopher Benjamin Coffey wrote:
>>
>> Hello!
>>
>> We are having an issue with high priority gpu jobs blocking low priority
>> cpu only jobs.
>>
>> Our cluster is setup with one partition, "all". All nodes reside in this
>> cluster. In this all partition we have four generations of compute nodes,
>> including gpu nodes. We do this to make use of those unused cores on the gpu
>> nodes for compute only jobs. We schedule the various different generations,
>> and gpu nodes by the user specifying a constraint (if they care), and a
>> --qos=gpu / --gres=gpu:tesla:1 for gpu nodes. The gpu qos will give the jobs
>> the highest priority in the queue, so that they can get scheduled sooner
>> onto the limited resource that we have in gpu's. So this has worked out real
>> nice for quite some time. But lately we've noticed that the gpu jobs are
>> blocking the cpu only jobs. Yes, the gpu jobs have higher priority, yet, the
>> gpu jobs can only run on a very small subset of nodes compared to the cpu
>> only jobs. But it appears that slurm isn't taking into consideration the
>> limited set of nodes the gpu job can run on. That’s the only possibility
>> that I see to the gpu jobs blocking the cpu only jobs. I'm not sure if this
>> is due to a recent slurm change, or if we just never noticed, but its
>> definitely happening.
>>
>> For example, the behavior happens in the following scenario
>>
>> - 15 compute nodes (no gpus) are idle
>> - All of the gpus are occupied
>> - 1000's of low priority compute only jobs in the pending queue
>> - 100's of highest priority gpu jobs in the pending queue
>>
>> In the above scenario, the above low priority jobs are not backfilled, or
>> started, yet compute only nodes remain idle. If I hold the gpu jobs, the
>> lower priority compute only jobs are then started.
>>
>> Anyone seen this? Am I thinking about this wrong? I would think that slurm
>> should not be considering the nodes with no gpus to fulfill the gpu jobs.
>>
>> I have an idea how to fix this scenario, but I think our current config
>> should work. The fix I am mulling over is to create a gpu partition, and
>> place the gpu nodes into that partition. Then, use the all_partitions job
>> submit plugin to schedule compute only jobs into both partitions. The gpu
>> jobs would then only land in the gpu partition. I'd think that would
>> definitely fix the issue, but maybe there is a down side. Yet, I think how
>> we have it should be working!?
>>
>> Thanks for your advice!
>>
>> Best,
>> Chris
>>
>> —
>> Christopher Coffey
>> High-Performance Computing
>> Northern Arizona University
>> 928-523-1167
>>
>
>
>
>