[slurm-users] Job in "priority" status - resources available

Wed Aug 2 12:09:52 UTC 2023

Hello,

I'm quite a newbie regarding Slurm. I recently created a small Slurm instance to manage our GPU resources. I have this situation:

 JOBID        STATE         TIME   ACCOUNT    PARTITION    PRIORITY              REASON CPU MIN_MEM              TRES_PER_NODE
    1739    PENDING         0:00  standard      gpu-low           5            Priority   1     80G    gres:gpu:a100_1g.10gb:1
    1738    PENDING         0:00  standard      gpu-low           5            Priority   1     80G  gres:gpu:a100-sxm4-80gb:1
    1737    PENDING         0:00  standard      gpu-low           5            Priority   1     80G  gres:gpu:a100-sxm4-80gb:1
    1736    PENDING         0:00  standard      gpu-low           5           Resources   1     80G  gres:gpu:a100-sxm4-80gb:1
    1740    PENDING         0:00  standard      gpu-low           1            Priority   1      8G      gres:gpu:a100_3g.39gb
    1735    PENDING         0:00  standard      gpu-low           1            Priority   8     64G  gres:gpu:a100-sxm4-80gb:1
    1596    RUNNING   1-13:26:45  standard      gpu-low           3                None   2     64G    gres:gpu:a100_1g.10gb:1
    1653    RUNNING     21:09:52  standard      gpu-low           2                None   1     16G                 gres:gpu:1
    1734    RUNNING        59:52  standard      gpu-low           1                None   8     64G  gres:gpu:a100-sxm4-80gb:1
    1733    RUNNING      1:01:54  standard      gpu-low           1                None   8     64G  gres:gpu:a100-sxm4-80gb:1
    1732    RUNNING      1:02:39  standard      gpu-low           1                None   8     40G  gres:gpu:a100-sxm4-80gb:1
    1731    RUNNING      1:08:28  standard      gpu-low           1                None   8     40G  gres:gpu:a100-sxm4-80gb:1
    1718    RUNNING     10:16:40  standard      gpu-low           1                None   2      8G              gres:gpu:v100
    1630    RUNNING   1-00:21:21  standard      gpu-low           1                None   1     30G      gres:gpu:a100_3g.39gb
    1610    RUNNING   1-09:53:23  standard      gpu-low           1                None   2      8G              gres:gpu:v100

Job 1736 is in the PENDING state since there are no more available a100-sxm4-80gb GPUs. The job priority starts to rise with time (priority 5) as expected. Now another user submits job 1739 on a gres:gpu:a100_1g.10gb:1 that is available, but the job is not starting since its priority is 1. This is obviously not the desired outcome, and I believe I must change the scheduling strategy. Could someone with more experience than me give me some hints?

Thanks, Cristiano
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20230802/27400545/attachment.htm>