[slurm-users] Compact scheduling strategy for small GPU jobs

Jack Chen scsvip at gmail.com
Thu Aug 12 16:06:17 UTC 2021


Cool, node weights are useful. I will split this big partition to two
partitions: one for small jobs, one for 8 gpus jobs. This will also help.

On Wed, Aug 11, 2021 at 3:57 AM Brian Andrus <toomuchit at gmail.com> wrote:

> You may also want to look at node weights. By setting them at different
> levels for each node, you can give a preference to one over the other.
>
> That may be a way to do a "try this node first" method of job placement.
>
> Brian Andrus
> On 8/10/2021 9:19 AM, Jack Chen wrote:
>
> Thanks for your reply! It's certain that slurm will not place small jobs
> on same node if resources are not available. But I'm using default values
> in my issue, job cmd is : srun -n 1 --cpus-per-task=2 --gres=gpu:1 'sleep
> 12000'.
>
> When I submit another 8  one gpu jobs, they can run both on node A and B.
> So I believe we can exclude resource reasons.
>
> Slurm version >= 17 supports gpus parameters, it helps run jobs when
> resource fragments occur. But it would be great help if slurms support
> compact scheduling strategy to run these small GPU jobs on one node to
> avoid resource fragments occurring.
>
> Later I will setup slurm newest versions and test the above test case.
> There are thousands of machines in my cluster, users want to submit
> hundreds of small jobs, so fragments are really annoying.
>
> PS: I replied above to Diego, forget to reply all. (:
>
>
> On Tue, Aug 10, 2021 at 11:44 PM Brian Andrus <toomuchit at gmail.com> wrote:
>
>> You may want to look at your resources. If the memory allocation adds up
>> such that there isn't enough left for any job to run, it won't matter that
>> there are still GPUs available.
>>
>> Similar for any other resource (CPUs, cores, etc)
>>
>> Brian Andrus
>>
>>
>> On 8/10/2021 8:07 AM, Jack Chen wrote:
>>
>> Does anyone have any ideas on this?
>>
>> On Fri, Aug 6, 2021 at 2:52 PM Jack Chen <scsvip at gmail.com> wrote:
>>
>>> I'm using slurm15.08.11, when I submit several 1 gpu jobs, slurm doesn't
>>> allocate nodes using compact strategy. Anyone know how to solve this? Will
>>> upgrading slurm latest version help ?
>>>
>>> For example, there are two nodes A and B with 8 gpus per node, I
>>> submitted 8 1 gpu jobs, slurm will allocate first 6 jobs on node A, then
>>> last 2 jobs on node B. Then when I submit one job with 8 gpus, it will
>>> pending because of gpu fragments: nodes A has 2 idle gpus, node b 6 idle
>>> gpus
>>>
>>> Thanks in advance!
>>>
>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20210813/cc0cc052/attachment.htm>


More information about the slurm-users mailing list