[slurm-users] Backfill isn’t working for a node with two GPUs that have different GRES types.

Randall Radmer radmer at gmail.com
Tue Apr 2 12:24:56 UTC 2019


Hi Marcus,

Following jobs are running or pending after I killed job 100816, which was
running on computelab-134's T4:
100815 RUNNING computelab-134 gpu:gv100:1 None1
100817 PENDING gpu:gv100:1 Resources1
100818 PENDING gpu:tu104:1 Resources1

$ scontrol -d show node computelab-134
NodeName=computelab-134 Arch=x86_64 CoresPerSocket=6
   CPUAlloc=6 CPUErr=0 CPUTot=12 CPULoad=0.00
   AvailableFeatures=(null)
   ActiveFeatures=(null)
   Gres=gpu:gv100:1,gpu:tu104:1
   GresDrain=N/A
   GresUsed=gpu:gv100:1(IDX:0),gpu:tu104:0(IDX:N/A)
   NodeAddr=computelab-134 NodeHostName=computelab-134 Version=17.11
   OS=Linux 4.4.0-143-generic #169-Ubuntu SMP Thu Feb 7 07:56:38 UTC 2019
   RealMemory=64307 AllocMem=32148 FreeMem=61126 Sockets=2 Boards=1
   State=MIXED ThreadsPerCore=1 TmpDisk=404938 Weight=1 Owner=N/A
MCS_label=N/A
   Partitions=test-backfill
   BootTime=2019-03-29T12:09:25 SlurmdStartTime=2019-04-01T11:34:35

 CfgTRES=cpu=12,mem=64307M,billing=12,gres/gpu=2,gres/gpu:gv100=1,gres/gpu:tu104=1
   AllocTRES=cpu=6,mem=32148M,gres/gpu=1,gres/gpu:gv100=1
   CapWatts=n/a
   CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s

$ scontrol -d show job 100815
JobId=100815 JobName=bash
   UserId=rradmer(27578) GroupId=hardware(30) MCS_label=N/A
   Priority=1 Nice=0 Account=cag QOS=normal
   JobState=RUNNING Reason=None Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=0 Reboot=0 ExitCode=0:0
   DerivedExitCode=0:0
   RunTime=00:06:45 TimeLimit=02:00:00 TimeMin=N/A
   SubmitTime=2019-04-02T05:13:05 EligibleTime=2019-04-02T05:13:05
   StartTime=2019-04-02T05:13:05 EndTime=2019-04-02T07:13:05 Deadline=N/A
   PreemptTime=None SuspendTime=None SecsPreSuspend=0
   LastSchedEval=2019-04-02T05:13:05
   Partition=test-backfill AllocNode:Sid=computelab-frontend-02:7873
   ReqNodeList=computelab-134 ExcNodeList=(null)
   NodeList=computelab-134
   BatchHost=computelab-134
   NumNodes=1 NumCPUs=6 NumTasks=1 CPUs/Task=6 ReqB:S:C:T=0:0:*:*
   TRES=cpu=6,mem=32148M,node=1,billing=6,gres/gpu=1,gres/gpu:gv100=1
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
     Nodes=computelab-134 CPU_IDs=0-5 Mem=32148 GRES_IDX=gpu:gv100(IDX:0)
   MinCPUsNode=6 MinMemoryNode=32148M MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   Gres=gpu:gv100:1 Reservation=(null)
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=/bin/bash
   WorkDir=/home/rradmer
   Power=

$ scontrol -d show job 100817
JobId=100817 JobName=bash
   UserId=rradmer(27578) GroupId=hardware(30) MCS_label=N/A
   Priority=1 Nice=0 Account=cag QOS=normal
   JobState=PENDING Reason=Resources Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=0 Reboot=0 ExitCode=0:0
   DerivedExitCode=0:0
   RunTime=00:00:00 TimeLimit=02:00:00 TimeMin=N/A
   SubmitTime=2019-04-02T05:13:11 EligibleTime=2019-04-02T05:13:11
   StartTime=2019-04-02T07:13:05 EndTime=2019-04-02T09:13:05 Deadline=N/A
   PreemptTime=None SuspendTime=None SecsPreSuspend=0
   LastSchedEval=2019-04-02T05:20:44
   Partition=test-backfill AllocNode:Sid=computelab-frontend-03:21736
   ReqNodeList=computelab-134 ExcNodeList=(null)
   NodeList=(null) SchedNodeList=computelab-134
   NumNodes=1-1 NumCPUs=6 NumTasks=1 CPUs/Task=6 ReqB:S:C:T=0:0:*:*
   TRES=cpu=6,mem=32148M,node=1,gres/gpu=1,gres/gpu:gv100=1
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
   MinCPUsNode=6 MinMemoryNode=32148M MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   Gres=gpu:gv100:1 Reservation=(null)
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=/bin/bash
   WorkDir=/home/rradmer
   Power=

$ scontrol -d show job 100818
JobId=100818 JobName=bash
   UserId=rradmer(27578) GroupId=hardware(30) MCS_label=N/A
   Priority=1 Nice=0 Account=cag QOS=normal
   JobState=PENDING Reason=Resources Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=0 Reboot=0 ExitCode=0:0
   DerivedExitCode=0:0
   RunTime=00:00:00 TimeLimit=02:00:00 TimeMin=N/A
   SubmitTime=2019-04-02T05:13:12 EligibleTime=2019-04-02T05:13:12
   StartTime=2019-04-02T09:13:00 EndTime=2019-04-02T11:13:00 Deadline=N/A
   PreemptTime=None SuspendTime=None SecsPreSuspend=0
   LastSchedEval=2019-04-02T05:21:32
   Partition=test-backfill AllocNode:Sid=computelab-frontend-02:12826
   ReqNodeList=computelab-134 ExcNodeList=(null)
   NodeList=(null) SchedNodeList=computelab-134
   NumNodes=1-1 NumCPUs=6 NumTasks=1 CPUs/Task=6 ReqB:S:C:T=0:0:*:*
   TRES=cpu=6,mem=32148M,node=1,gres/gpu=1,gres/gpu:tu104=1
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
   MinCPUsNode=6 MinMemoryNode=32148M MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   Gres=gpu:tu104:1 Reservation=(null)
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=/bin/bash
   WorkDir=/home/rradmer
   Power=


On Mon, Apr 1, 2019 at 11:24 PM Marcus Wagner <wagner at itc.rwth-aachen.de>
wrote:

> Dear Randall,
>
> could you please also provide
>
>
> scontrol -d show node computelab-134
> scontrol -d show job 100091
> scontrol -d show job 100094
>
>
> Best
> Marcus
>
> On 4/1/19 4:31 PM, Randall Radmer wrote:
>
> I can’t get backfill to work for a machine with two GPUs (one is a P4 and
> the other a T4).
>
> Submitting jobs works as expected: if the GPU I request is free, then my
> job runs, otherwise it goes into a pending state.  But if I have pending
> jobs for one GPU ahead of pending jobs for the other GPU, I see blocking
> issues.
>
> More specifically, I can create a case where I am running a job on each of
> the GPUs and have a pending job waiting for the P4 followed by a pending
> job waiting for a T4.  I would expect that if I exit the running T4 job,
> then backfill would start the pending T4 job, even though it has to job
> ahead of the pending P4 job. This does not happen...
>
> The following shows my jobs after I exited from a running T4 job, which
> had ID 100092:
>
> $ squeue --noheader -u rradmer --Format=jobid,state,gres,nodelist,reason |
> sed 's/  */ /g' | sort
>
> 100091 RUNNING gpu:gv100:1 computelab-134 None
>
> 100093 PENDING gpu:gv100:1 Resources
>
> 100094 PENDING gpu:tu104:1 Resources
>
> I can find no reason why 100094  doesn’t start running (I’ve waited up to
> an hour, just to make sure).
>
> System config info and log snippets shown below.
>
> Thanks much,
>
> Randy
>
> Node state corresponding to the squeue command, shown above:
>
> $ scontrol show node computelab-134 | grep -i [gt]res
>
>   Gres=gpu:gv100:1,gpu:tu104:1
>
>
>   CfgTRES=cpu=12,mem=64307M,billing=12,gres/gpu=2,gres/gpu:gv100=1,gres/gpu:tu104=1
>
>   AllocTRES=cpu=6,mem=32148M,gres/gpu=1,gres/gpu:gv100=1
>
>
> Slurm config follows:
>
> $ scontrol show conf | grep -Ei '(gres|^Sched|prio|vers)'
>
> AccountingStorageTRES =
> cpu,mem,energy,node,billing,gres/gpu,gres/gpu:gp100,gres/gpu:gp104,gres/gpu:gv100,gres/gpu:tu102,gres/gpu:tu104,gres/gpu:tu106
>
> GresTypes               = gpu
>
> PriorityParameters      = (null)
>
> PriorityDecayHalfLife   = 7-00:00:00
>
> PriorityCalcPeriod      = 00:05:00
>
> PriorityFavorSmall      = No
>
> PriorityFlags           =
>
> PriorityMaxAge          = 7-00:00:00
>
> PriorityUsageResetPeriod = NONE
>
> PriorityType            = priority/multifactor
>
> PriorityWeightAge       = 0
>
> PriorityWeightFairShare = 0
>
> PriorityWeightJobSize   = 0
>
> PriorityWeightPartition = 0
>
> PriorityWeightQOS       = 0
>
> PriorityWeightTRES      = (null)
>
> PropagatePrioProcess    = 0
>
> SchedulerParameters     =
> default_queue_depth=2000,bf_continue,bf_ignore_newly_avail_nodes,bf_max_job_test=1000,bf_window=10080,kill_invalid_depend
>
> SchedulerTimeSlice      = 30 sec
>
> SchedulerType           = sched/backfill
>
> SLURM_VERSION           = 17.11.9-2
>
> GPUs on node:
>
> $ nvidia-smi --query-gpu=index,name,gpu_bus_id --format=csv
>
> index, name, pci.bus_id
>
> 0, Tesla T4, 00000000:82:00.0
>
> 1, Tesla P4, 00000000:83:00.0
>
> The gres file on node:
>
> $ cat /etc/slurm/gres.conf
>
> Name=gpu Type=tu104 File=/dev/nvidia0 Cores=0,1,2,3,4,5
>
> Name=gpu Type=gp104 File=/dev/nvidia1 Cores=6,7,8,9,10,11
>
> Random sample of SlurmSchedLogFile:
>
> $ sudo tail -3 slurm.sched.log
>
> [2019-04-01T08:14:23.727] sched: Running job scheduler
>
> [2019-04-01T08:14:23.728] sched: JobId=100093. State=PENDING.
> Reason=Resources. Priority=1. Partition=test-backfill.
>
> [2019-04-01T08:14:23.728] sched: JobId=100094. State=PENDING.
> Reason=Resources. Priority=1. Partition=test-backfill.
>
> Random sample of SlurmctldLogFile:
>
> $ sudo grep backfill slurmctld.log  | tail -5
>
> [2019-04-01T08:16:53.281] backfill: beginning
>
> [2019-04-01T08:16:53.281] backfill test for JobID=100093 Prio=1
> Partition=test-backfill
>
> [2019-04-01T08:16:53.281] backfill test for JobID=100094 Prio=1
> Partition=test-backfill
>
> [2019-04-01T08:16:53.281] backfill: reached end of job queue
>
> [2019-04-01T08:16:53.281] backfill: completed testing 2(2) jobs, usec=707
>
>
> --
> Marcus Wagner, Dipl.-Inf.
>
> IT Center
> Abteilung: Systeme und Betrieb
> RWTH Aachen University
> Seffenter Weg 23
> 52074 Aachen
> Tel: +49 241 80-24383
> Fax: +49 241 80-624383wagner at itc.rwth-aachen.dewww.itc.rwth-aachen.de
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20190402/522e87b9/attachment-0001.html>


More information about the slurm-users mailing list