[slurm-users] Compact scheduling strategy for small GPU jobs
Brian Andrus
toomuchit at gmail.com
Tue Aug 10 19:55:23 UTC 2021
You may also want to look at node weights. By setting them at different
levels for each node, you can give a preference to one over the other.
That may be a way to do a "try this node first" method of job placement.
Brian Andrus
On 8/10/2021 9:19 AM, Jack Chen wrote:
> Thanks for your reply! It's certain that slurm will not place small
> jobs on same node if resources are not available. But I'm using
> default values in my issue, job cmd is : srun -n 1 --cpus-per-task=2
> --gres=gpu:1 'sleep 12000'.
>
> When I submit another 8 one gpu jobs, they can run both on node A and
> B. So I believe we can exclude resource reasons.
>
> Slurm version >= 17 supports gpus parameters, it helps run jobs when
> resource fragments occur. But it would be great help if slurms support
> compact scheduling strategy to run these small GPU jobs on one node to
> avoid resource fragments occurring.
>
> Later I will setup slurm newest versions and test the above test case.
> There are thousands of machines in my cluster, users want to submit
> hundreds of small jobs, so fragments are really annoying.
>
> PS: I replied above to Diego, forget to reply all. (:
>
>
> On Tue, Aug 10, 2021 at 11:44 PM Brian Andrus <toomuchit at gmail.com
> <mailto:toomuchit at gmail.com>> wrote:
>
> You may want to look at your resources. If the memory allocation
> adds up such that there isn't enough left for any job to run, it
> won't matter that there are still GPUs available.
>
> Similar for any other resource (CPUs, cores, etc)
>
> Brian Andrus
>
>
> On 8/10/2021 8:07 AM, Jack Chen wrote:
>> Does anyone have any ideas on this?
>>
>> On Fri, Aug 6, 2021 at 2:52 PM Jack Chen <scsvip at gmail.com
>> <mailto:scsvip at gmail.com>> wrote:
>>
>> I'm using slurm15.08.11, when I submit several 1 gpu jobs,
>> slurm doesn't allocate nodes using compact strategy. Anyone
>> know how to solve this? Will upgrading slurm latest version
>> help ?
>>
>> For example, there are two nodes A and B with 8 gpus per
>> node, I submitted 8 1 gpu jobs, slurm will allocate first 6
>> jobs on node A, then last 2 jobs on node B. Then when I
>> submit one job with 8 gpus, it will pending because of gpu
>> fragments: nodes A has 2 idle gpus, node b 6 idle gpus
>>
>> Thanks in advance!
>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20210810/afee9d8d/attachment.htm>
More information about the slurm-users
mailing list