[slurm-users] not allocating jobs even resources are free

navin srivastava navin.altair at gmail.com
Fri Apr 24 17:52:43 UTC 2020


In addition to the above when i see the sprio of both the jobs it says :-

for normal queue jobs all jobs showing the same priority

 JOBID PARTITION   PRIORITY  FAIRSHARE
        1291352 normal           15789      15789

for GPUsmall all jobs showing the same priority.

 JOBID PARTITION   PRIORITY  FAIRSHARE
        1291339 GPUsmall      21052      21053

On Fri, Apr 24, 2020 at 11:14 PM navin srivastava <navin.altair at gmail.com>
wrote:

> Hi Team,
>
> we are facing some issue in our environment. The resources are free but
> job is going into the QUEUE state but not running.
>
> i have attached the slurm.conf file here.
>
> scenario:-
>
> There are job only in the 2 partitions:
>  344 jobs are in PD state in normal partition and the node belongs
> from the normal partitions are full and no more job can run.
>
> 1300 JOBS are in GPUsmall partition are in queue and enough CPU is
> avaiable to execute the jobs but i see the jobs are not scheduling on free
> nodes.
>
> Rest there are no pend jobs in any other partition .
> eg:-
> node status:- node18
>
> NodeName=node18 Arch=x86_64 CoresPerSocket=18
>    CPUAlloc=6 CPUErr=0 CPUTot=36 CPULoad=4.07
>    AvailableFeatures=K2200
>    ActiveFeatures=K2200
>    Gres=gpu:2
>    NodeAddr=node18 NodeHostName=node18 Version=17.11
>    OS=Linux 4.4.140-94.42-default #1 SMP Tue Jul 17 07:44:50 UTC 2018
> (0b375e4)
>    RealMemory=1 AllocMem=0 FreeMem=79532 Sockets=2 Boards=1
>    State=MIXED ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
>    Partitions=GPUsmall,pm_shared
>    BootTime=2019-12-10T14:16:37 SlurmdStartTime=2019-12-10T14:24:08
>    CfgTRES=cpu=36,mem=1M,billing=36
>    AllocTRES=cpu=6
>    CapWatts=n/a
>    CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
>    ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
>
> node19:-
>
> NodeName=node19 Arch=x86_64 CoresPerSocket=18
>    CPUAlloc=16 CPUErr=0 CPUTot=36 CPULoad=15.43
>    AvailableFeatures=K2200
>    ActiveFeatures=K2200
>    Gres=gpu:2
>    NodeAddr=node19 NodeHostName=node19 Version=17.11
>    OS=Linux 4.12.14-94.41-default #1 SMP Wed Oct 31 12:25:04 UTC 2018
> (3090901)
>    RealMemory=1 AllocMem=0 FreeMem=63998 Sockets=2 Boards=1
>    State=MIXED ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
>    Partitions=GPUsmall,pm_shared
>    BootTime=2020-03-12T06:51:54 SlurmdStartTime=2020-03-12T06:53:14
>    CfgTRES=cpu=36,mem=1M,billing=36
>    AllocTRES=cpu=16
>    CapWatts=n/a
>    CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
>    ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
>
> could you please help me to understand what could be the reason?
>
>
>
>
>
>
>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20200424/6e8e59b5/attachment.htm>


More information about the slurm-users mailing list