[slurm-users] slurm-users Digest, Vol 70, Issue 3
Cumer Cristiano
CristianoMaria.Cumer at unibz.it
Thu Aug 3 16:52:50 UTC 2023
Hi Michael,
Indeed I had the older scheduler loaded and not the backfill. I have updated the configuration and will see if the scheduler will pick up the pending jobs.
Thanks
Cristiano
________________________________
From: slurm-users <slurm-users-bounces at lists.schedmd.com> on behalf of slurm-users-request at lists.schedmd.com <slurm-users-request at lists.schedmd.com>
Sent: Wednesday, August 2, 2023 4:15 PM
To: slurm-users at lists.schedmd.com <slurm-users at lists.schedmd.com>
Subject: slurm-users Digest, Vol 70, Issue 3
[You don't often get email from slurm-users-request at lists.schedmd.com. Learn why this is important at https://aka.ms/LearnAboutSenderIdentification ]
Send slurm-users mailing list submissions to
slurm-users at lists.schedmd.com
To subscribe or unsubscribe via the World Wide Web, visit
https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.schedmd.com%2Fcgi-bin%2Fmailman%2Flistinfo%2Fslurm-users&data=05%7C01%7CCristianoMaria.Cumer%40unibz.it%7C5c0379db010c4a4a747908db936311f0%7C9251326703e3401a80d4c58ed6674e3b%7C0%7C0%7C638265825947787326%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=LtVh8aZ9q7GcEmhOB158TaIlQjll5OI3XOe9rcglrq8%3D&reserved=0<https://lists.schedmd.com/cgi-bin/mailman/listinfo/slurm-users>
or, via email, send a message with subject or body 'help' to
slurm-users-request at lists.schedmd.com
You can reach the person managing the list at
slurm-users-owner at lists.schedmd.com
When replying, please edit your Subject line so it is more specific
than "Re: Contents of slurm-users digest..."
Today's Topics:
1. Job in "priority" status - resources available (Cumer Cristiano)
2. Re: Job in "priority" status - resources available
(Michael Gutteridge)
----------------------------------------------------------------------
Message: 1
Date: Wed, 2 Aug 2023 12:09:52 +0000
From: Cumer Cristiano <CristianoMaria.Cumer at unibz.it>
To: "slurm-users at lists.schedmd.com" <slurm-users at lists.schedmd.com>
Subject: [slurm-users] Job in "priority" status - resources available
Message-ID:
<PAVPR07MB91916B49909E972995CE0806E10BA at PAVPR07MB9191.eurprd07.prod.outlook.com>
Content-Type: text/plain; charset="iso-8859-1"
Hello,
I'm quite a newbie regarding Slurm. I recently created a small Slurm instance to manage our GPU resources. I have this situation:
JOBID STATE TIME ACCOUNT PARTITION PRIORITY REASON CPU MIN_MEM TRES_PER_NODE
1739 PENDING 0:00 standard gpu-low 5 Priority 1 80G gres:gpu:a100_1g.10gb:1
1738 PENDING 0:00 standard gpu-low 5 Priority 1 80G gres:gpu:a100-sxm4-80gb:1
1737 PENDING 0:00 standard gpu-low 5 Priority 1 80G gres:gpu:a100-sxm4-80gb:1
1736 PENDING 0:00 standard gpu-low 5 Resources 1 80G gres:gpu:a100-sxm4-80gb:1
1740 PENDING 0:00 standard gpu-low 1 Priority 1 8G gres:gpu:a100_3g.39gb
1735 PENDING 0:00 standard gpu-low 1 Priority 8 64G gres:gpu:a100-sxm4-80gb:1
1596 RUNNING 1-13:26:45 standard gpu-low 3 None 2 64G gres:gpu:a100_1g.10gb:1
1653 RUNNING 21:09:52 standard gpu-low 2 None 1 16G gres:gpu:1
1734 RUNNING 59:52 standard gpu-low 1 None 8 64G gres:gpu:a100-sxm4-80gb:1
1733 RUNNING 1:01:54 standard gpu-low 1 None 8 64G gres:gpu:a100-sxm4-80gb:1
1732 RUNNING 1:02:39 standard gpu-low 1 None 8 40G gres:gpu:a100-sxm4-80gb:1
1731 RUNNING 1:08:28 standard gpu-low 1 None 8 40G gres:gpu:a100-sxm4-80gb:1
1718 RUNNING 10:16:40 standard gpu-low 1 None 2 8G gres:gpu:v100
1630 RUNNING 1-00:21:21 standard gpu-low 1 None 1 30G gres:gpu:a100_3g.39gb
1610 RUNNING 1-09:53:23 standard gpu-low 1 None 2 8G gres:gpu:v100
Job 1736 is in the PENDING state since there are no more available a100-sxm4-80gb GPUs. The job priority starts to rise with time (priority 5) as expected. Now another user submits job 1739 on a gres:gpu:a100_1g.10gb:1 that is available, but the job is not starting since its priority is 1. This is obviously not the desired outcome, and I believe I must change the scheduling strategy. Could someone with more experience than me give me some hints?
Thanks, Cristiano
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://eur03.safelinks.protection.outlook.com/?url=http%3A%2F%2Flists.schedmd.com%2Fpipermail%2Fslurm-users%2Fattachments%2F20230802%2F27400545%2Fattachment-0001.htm&data=05%7C01%7CCristianoMaria.Cumer%40unibz.it%7C5c0379db010c4a4a747908db936311f0%7C9251326703e3401a80d4c58ed6674e3b%7C0%7C0%7C638265825947787326%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=LwbGVTucT%2B01WhlWicMYUqss%2FRRZMCLHlGMfOsTAckg%3D&reserved=0<http://lists.schedmd.com/pipermail/slurm-users/attachments/20230802/27400545/attachment-0001.htm>>
------------------------------
Message: 2
Date: Wed, 2 Aug 2023 07:15:06 -0700
From: Michael Gutteridge <michael.gutteridge at gmail.com>
To: Slurm User Community List <slurm-users at lists.schedmd.com>
Subject: Re: [slurm-users] Job in "priority" status - resources
available
Message-ID:
<CALUL84uJ7yc7H_eb7c1vaHHdoyTRPB5FHz35u8z24mmzWGCFwA at mail.gmail.com>
Content-Type: text/plain; charset="utf-8"
I'm not sure there's enough information in your message- Slurm version and
configs are often necessary to make a more confident diagnosis. However,
the behaviour you are looking for (lower priority jobs skipping the line)
is called "backfill". There's docs here:
https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fslurm.schedmd.com%2Fsched_config.html%23backfill&data=05%7C01%7CCristianoMaria.Cumer%40unibz.it%7C5c0379db010c4a4a747908db936311f0%7C9251326703e3401a80d4c58ed6674e3b%7C0%7C0%7C638265825947787326%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=6Bh%2FcyWGU3CyZwhR8igsrytnV8fE5B7RpYFzEzwXapY%3D&reserved=0<https://slurm.schedmd.com/sched_config.html#backfill>
It should be loaded and active by default which is why I'm not super
confident here. There may also be something else going on with the node
configuration as it looks like 1596 would maybe need the same node? Maybe
there's not enough CPU or memory to accommodate both jobs (1596 and 1739)?
HTH
- Michael
On Wed, Aug 2, 2023 at 5:13?AM Cumer Cristiano <
CristianoMaria.Cumer at unibz.it> wrote:
> Hello,
>
> I'm quite a newbie regarding Slurm. I recently created a small Slurm
> instance to manage our GPU resources. I have this situation:
>
> JOBID STATE TIME ACCOUNT PARTITION PRIORITY
> REASON CPU MIN_MEM TRES_PER_NODE
> 1739 PENDING 0:00 standard gpu-low 5
> Priority 1 80G gres:gpu:a100_1g.10gb:1
> 1738 PENDING 0:00 standard gpu-low 5
> Priority 1 80G gres:gpu:a100-sxm4-80gb:1
> 1737 PENDING 0:00 standard gpu-low 5
> Priority 1 80G gres:gpu:a100-sxm4-80gb:1
> 1736 PENDING 0:00 standard gpu-low 5
> Resources 1 80G gres:gpu:a100-sxm4-80gb:1
> 1740 PENDING 0:00 standard gpu-low 1
> Priority 1 8G gres:gpu:a100_3g.39gb
> 1735 PENDING 0:00 standard gpu-low 1
> Priority 8 64G gres:gpu:a100-sxm4-80gb:1
> 1596 RUNNING 1-13:26:45 standard gpu-low 3
> None 2 64G gres:gpu:a100_1g.10gb:1
> 1653 RUNNING 21:09:52 standard gpu-low 2
> None 1 16G gres:gpu:1
> 1734 RUNNING 59:52 standard gpu-low 1
> None 8 64G gres:gpu:a100-sxm4-80gb:1
> 1733 RUNNING 1:01:54 standard gpu-low 1
> None 8 64G gres:gpu:a100-sxm4-80gb:1
> 1732 RUNNING 1:02:39 standard gpu-low 1
> None 8 40G gres:gpu:a100-sxm4-80gb:1
> 1731 RUNNING 1:08:28 standard gpu-low 1
> None 8 40G gres:gpu:a100-sxm4-80gb:1
> 1718 RUNNING 10:16:40 standard gpu-low 1
> None 2 8G gres:gpu:v100
> 1630 RUNNING 1-00:21:21 standard gpu-low 1
> None 1 30G gres:gpu:a100_3g.39gb
> 1610 RUNNING 1-09:53:23 standard gpu-low 1
> None 2 8G gres:gpu:v100
>
>
> Job 1736 is in the PENDING state since there are no more available
> a100-sxm4-80gb GPUs. The job priority starts to rise with time (priority 5)
> as expected. Now another user submits job 1739 on a gres:gpu:a100_1g.10gb:1
> that is available, but the job is not starting since its priority is 1.
> This is obviously not the desired outcome, and I believe I must change the
> scheduling strategy. Could someone with more experience than me give me
> some hints?
>
> Thanks, Cristiano
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://eur03.safelinks.protection.outlook.com/?url=http%3A%2F%2Flists.schedmd.com%2Fpipermail%2Fslurm-users%2Fattachments%2F20230802%2F0e4837c3%2Fattachment.htm&data=05%7C01%7CCristianoMaria.Cumer%40unibz.it%7C5c0379db010c4a4a747908db936311f0%7C9251326703e3401a80d4c58ed6674e3b%7C0%7C0%7C638265825947787326%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=Nrc4A9AOAkkjSY9t5HNWsx%2BGfH4Gjl%2Fe9jaZ8sUiupQ%3D&reserved=0<http://lists.schedmd.com/pipermail/slurm-users/attachments/20230802/0e4837c3/attachment.htm>>
End of slurm-users Digest, Vol 70, Issue 3
******************************************
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20230803/4a773131/attachment-0001.htm>
More information about the slurm-users
mailing list