[slurm-users] slurm jobs are pending but resources are available
Michael Di Domenico
mdidomenico4 at gmail.com
Mon Apr 16 10:50:57 MDT 2018
On Mon, Apr 16, 2018 at 6:35 AM, <Marius.Cetateanu at sony.com> wrote:
>
> According to the above I have the backfill scheduler enabled with CPUs and Memory configured as
> resources. I have 56 CPUs and 256GB of RAM in my resource pool. I would expect that he backfill
>scheduler attempts to allocate the resources in order to fill as much of the cores as possible if there
> are multiple processes asking for more resources than available. In my case I have the following queue:
>
> I'm going through the documentation again and again but I cannot figure out what am I doing wrong ...
> Why do I have the above situation? What should I change to my config to make this work?
>
> scontrol show -dd job <jobid> shows me the following:
>
> JobId=2361 JobName=training_carlib
> UserId=mcetateanu(1000) GroupId=mcetateanu(1001) MCS_label=N/A
> Priority=4294901726 Nice=0 Account=(null) QOS=(null)
> JobState=PENDING Reason=Resources Dependency=(null)
> Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
> RunTime=00:00:00 TimeLimit=3-04:00:00 TimeMin=N/A
> SubmitTime=2018-03-27T10:30:38 EligibleTime=2018-03-27T10:30:38
> StartTime=2018-03-28T10:27:36 EndTime=2018-03-31T14:27:36 Deadline=N/A
> PreemptTime=None SuspendTime=None SecsPreSuspend=0
> Partition=main_compute AllocNode:Sid=zalmoxis:23690
> ReqNodeList=(null) ExcNodeList=(null)
> NodeList=(null) SchedNodeList=cn_burebista
> NumNodes=1 NumCPUs=20 NumTasks=1 CPUs/Task=20 ReqB:S:C:T=0:0:*:*
> TRES=cpu=20,node=1
> Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
> MinCPUsNode=20 MinMemoryNode=0 MinTmpDiskNode=0
> Features=(null) Gres=(null) Reservation=(null)
> OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
> Command=/home/mcetateanu/workspace/CarLib/src/_outputs/linux-xeon_e5v4-icc17.0/bin/classifier/train_classifier.sh
> WorkDir=/home/mcetateanu/workspace/CarLib/src/_outputs/linux-xeon_e5v4-icc17.0/bin/classifier
> StdErr=/home/mcetateanu/workspace/CarLib/src/_outputs/linux-xeon_e5v4-icc17.0/bin/classifier/training_job_2383.out
> StdIn=/dev/null
> StdOut=/home/mcetateanu/workspace/CarLib/src/_out
perhaps i missed something in the email, but it sounds like you have
56 cores, you have two running jobs that consume 52 cores, leaving you
four free. then a third job came along and requested 20 cores (based
on the the show job output). slurm doesn't overcommit resources, so a
20 cpu job will not fit if there are only four cpus free
More information about the slurm-users
mailing list