[slurm-users] slurm jobs are pending but resources are available

Mon Apr 16 10:50:57 MDT 2018

On Mon, Apr 16, 2018 at 6:35 AM,  <Marius.Cetateanu at sony.com> wrote:
>
> According to the above I have the backfill scheduler enabled with CPUs and Memory configured as
> resources. I have 56 CPUs and 256GB of RAM in my resource pool. I would expect that he backfill
>scheduler attempts to allocate the resources in order to fill as much of the cores as possible if there
> are multiple processes asking for more resources than available. In my case I have the following queue:
>
> I'm going through the documentation again and again but I cannot figure out what am I doing wrong ...
> Why do I have the above situation? What should I change to my config to make this work?
>
> scontrol show -dd job <jobid> shows me the following:
>
> JobId=2361 JobName=training_carlib
>    UserId=mcetateanu(1000) GroupId=mcetateanu(1001) MCS_label=N/A
>    Priority=4294901726 Nice=0 Account=(null) QOS=(null)
>    JobState=PENDING Reason=Resources Dependency=(null)
>    Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
>    RunTime=00:00:00 TimeLimit=3-04:00:00 TimeMin=N/A
>    SubmitTime=2018-03-27T10:30:38 EligibleTime=2018-03-27T10:30:38
>    StartTime=2018-03-28T10:27:36 EndTime=2018-03-31T14:27:36 Deadline=N/A
>    PreemptTime=None SuspendTime=None SecsPreSuspend=0
>    Partition=main_compute AllocNode:Sid=zalmoxis:23690
>    ReqNodeList=(null) ExcNodeList=(null)
>    NodeList=(null) SchedNodeList=cn_burebista
>    NumNodes=1 NumCPUs=20 NumTasks=1 CPUs/Task=20 ReqB:S:C:T=0:0:*:*
>    TRES=cpu=20,node=1
>    Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
>    MinCPUsNode=20 MinMemoryNode=0 MinTmpDiskNode=0
>    Features=(null) Gres=(null) Reservation=(null)
>    OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
>    Command=/home/mcetateanu/workspace/CarLib/src/_outputs/linux-xeon_e5v4-icc17.0/bin/classifier/train_classifier.sh
>    WorkDir=/home/mcetateanu/workspace/CarLib/src/_outputs/linux-xeon_e5v4-icc17.0/bin/classifier
>    StdErr=/home/mcetateanu/workspace/CarLib/src/_outputs/linux-xeon_e5v4-icc17.0/bin/classifier/training_job_2383.out
>    StdIn=/dev/null
>    StdOut=/home/mcetateanu/workspace/CarLib/src/_out

perhaps i missed something in the email, but it sounds like you have
56 cores, you have two running jobs that consume 52 cores, leaving you
four free.  then a third job came along and requested 20 cores (based
on the the show job output).  slurm doesn't overcommit resources, so a
20 cpu job will not fit if there are only four cpus free