[slurm-users] not allocating jobs even resources are free

Brian W. Johanson bjohanso at psc.edu
Fri Apr 24 18:19:03 UTC 2020


Without seeing the jobs in your queue, I would expect the next job in 
FIFO order to be too large to fit in the current idle resources.

Configure it to use the backfill scheduler: SchedulerType=sched/backfill

       SchedulerType
               Identifies  the type of scheduler to be used.  Note the 
slurmctld daemon must be restarted for a change in scheduler type to 
become effective (reconfiguring a running daemon has no effect for this 
parameter).  The scontrol command can be used to manually change job 
priorities if desired.  Acceptable values include:

               sched/backfill
                      For a backfill scheduling module to augment the 
default FIFO scheduling.  Backfill scheduling will initiate 
lower-priority jobs if doing so does not delay the expected initiation 
time of any  higher  priority  job.   Effectiveness of  backfill 
scheduling is dependent upon users specifying job time limits, otherwise 
all jobs will have the same time limit and backfilling is impossible.  
Note documentation for the SchedulerParameters option above.  This is 
the default configuration.

               sched/builtin
                      This  is  the  FIFO scheduler which initiates jobs 
in priority order.  If any job in the partition can not be scheduled, no 
lower priority job in that partition will be scheduled.  An exception is 
made for jobs that can not run due to partition constraints (e.g. the 
time limit) or down/drained nodes.  In that case, lower priority jobs 
can be initiated and not impact the higher priority job.



Your partitions are set with maxtime=INFINITE, if your users are not 
specifying a reasonable timelimit to their jobs, this won't help either.


-b


On 4/24/20 1:52 PM, navin srivastava wrote:
> In addition to the above when i see the sprio of both the jobs it says :-
>
> for normal queue jobs all jobs showing the same priority
>
>  JOBID PARTITION   PRIORITY  FAIRSHARE
>         1291352 normal           15789      15789
>
> for GPUsmall all jobs showing the same priority.
>
>  JOBID PARTITION   PRIORITY  FAIRSHARE
>         1291339 GPUsmall      21052      21053
>
> On Fri, Apr 24, 2020 at 11:14 PM navin srivastava 
> <navin.altair at gmail.com <mailto:navin.altair at gmail.com>> wrote:
>
>     Hi Team,
>
>     we are facing some issue in our environment. The resources are
>     free but job is going into the QUEUE state but not running.
>
>     i have attached the slurm.conf file here.
>
>     scenario:-
>
>     There are job only in the 2 partitions:
>      344 jobs are in PD state in normal partition and the node belongs
>     from the normal partitions are full and no more job can run.
>
>     1300 JOBS are in GPUsmall partition are in queue and enough CPU is
>     avaiable to execute the jobs but i see the jobs are not
>     scheduling on free nodes.
>
>     Rest there are no pend jobs in any other partition .
>     eg:-
>     node status:- node18
>
>     NodeName=node18 Arch=x86_64 CoresPerSocket=18
>        CPUAlloc=6 CPUErr=0 CPUTot=36 CPULoad=4.07
>        AvailableFeatures=K2200
>        ActiveFeatures=K2200
>        Gres=gpu:2
>        NodeAddr=node18 NodeHostName=node18 Version=17.11
>        OS=Linux 4.4.140-94.42-default #1 SMP Tue Jul 17 07:44:50 UTC
>     2018 (0b375e4)
>        RealMemory=1 AllocMem=0 FreeMem=79532 Sockets=2 Boards=1
>        State=MIXED ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A
>     MCS_label=N/A
>        Partitions=GPUsmall,pm_shared
>        BootTime=2019-12-10T14:16:37 SlurmdStartTime=2019-12-10T14:24:08
>        CfgTRES=cpu=36,mem=1M,billing=36
>        AllocTRES=cpu=6
>        CapWatts=n/a
>        CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
>        ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
>
>     node19:-
>
>     NodeName=node19 Arch=x86_64 CoresPerSocket=18
>        CPUAlloc=16 CPUErr=0 CPUTot=36 CPULoad=15.43
>        AvailableFeatures=K2200
>        ActiveFeatures=K2200
>        Gres=gpu:2
>        NodeAddr=node19 NodeHostName=node19 Version=17.11
>        OS=Linux 4.12.14-94.41-default #1 SMP Wed Oct 31 12:25:04 UTC
>     2018 (3090901)
>        RealMemory=1 AllocMem=0 FreeMem=63998 Sockets=2 Boards=1
>        State=MIXED ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A
>     MCS_label=N/A
>        Partitions=GPUsmall,pm_shared
>        BootTime=2020-03-12T06:51:54 SlurmdStartTime=2020-03-12T06:53:14
>        CfgTRES=cpu=36,mem=1M,billing=36
>        AllocTRES=cpu=16
>        CapWatts=n/a
>        CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
>        ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
>
>     could you please help me to understand what could be the reason?
>
>
>
>
>
>
>
>
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20200424/67491092/attachment-0001.htm>


More information about the slurm-users mailing list