[slurm-users] Backfill isn’t working for a node with two GPUs that have different GRES types.

Marcus Wagner wagner at itc.rwth-aachen.de
Wed Apr 3 08:47:17 UTC 2019


Hmm...,

I'm a bit dazzled, seems to be ok as far as I can tell.

Did you try to restart slurmctld?
I had a case, where users could not submit to the default partition 
anymore, since SLURM told them (if I remember right)
wrong account/partition combination
or something like that.
My first suspicion was my submission script since I changed it recently, 
but I could not find any error. scontrol reconfig did not help.
But everything went well again, after I restarted the slurmctld.

Might be worth a try.


Best
Marcus

On 4/2/19 1:24 PM, Randall Radmer wrote:
> Hi Marcus,
>
> Following jobs are running or pending after I killed job 100816, which 
> was running on computelab-134's T4:
> 100815 RUNNING computelab-134 gpu:gv100:1 None1
> 100817 PENDING gpu:gv100:1 Resources1
> 100818 PENDING gpu:tu104:1 Resources1
>
> $ scontrol -d show node computelab-134
> NodeName=computelab-134 Arch=x86_64 CoresPerSocket=6
>    CPUAlloc=6 CPUErr=0 CPUTot=12 CPULoad=0.00
>    AvailableFeatures=(null)
>    ActiveFeatures=(null)
>    Gres=gpu:gv100:1,gpu:tu104:1
>    GresDrain=N/A
>  GresUsed=gpu:gv100:1(IDX:0),gpu:tu104:0(IDX:N/A)
>    NodeAddr=computelab-134 NodeHostName=computelab-134 Version=17.11
>    OS=Linux 4.4.0-143-generic #169-Ubuntu SMP Thu Feb 7 07:56:38 UTC 2019
>    RealMemory=64307 AllocMem=32148 FreeMem=61126 Sockets=2 Boards=1
>    State=MIXED ThreadsPerCore=1 TmpDisk=404938 Weight=1 Owner=N/A 
> MCS_label=N/A
>    Partitions=test-backfill
>    BootTime=2019-03-29T12:09:25 SlurmdStartTime=2019-04-01T11:34:35
>  CfgTRES=cpu=12,mem=64307M,billing=12,gres/gpu=2,gres/gpu:gv100=1,gres/gpu:tu104=1
>  AllocTRES=cpu=6,mem=32148M,gres/gpu=1,gres/gpu:gv100=1
>    CapWatts=n/a
>    CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
>    ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
>
> $ scontrol -d show job 100815
> JobId=100815 JobName=bash
>    UserId=rradmer(27578) GroupId=hardware(30) MCS_label=N/A
>    Priority=1 Nice=0 Account=cag QOS=normal
>    JobState=RUNNING Reason=None Dependency=(null)
>    Requeue=1 Restarts=0 BatchFlag=0 Reboot=0 ExitCode=0:0
>    DerivedExitCode=0:0
>    RunTime=00:06:45 TimeLimit=02:00:00 TimeMin=N/A
>    SubmitTime=2019-04-02T05:13:05 EligibleTime=2019-04-02T05:13:05
>    StartTime=2019-04-02T05:13:05 EndTime=2019-04-02T07:13:05 Deadline=N/A
>    PreemptTime=None SuspendTime=None SecsPreSuspend=0
>    LastSchedEval=2019-04-02T05:13:05
>    Partition=test-backfill AllocNode:Sid=computelab-frontend-02:7873
>    ReqNodeList=computelab-134 ExcNodeList=(null)
>    NodeList=computelab-134
>    BatchHost=computelab-134
>    NumNodes=1 NumCPUs=6 NumTasks=1 CPUs/Task=6 ReqB:S:C:T=0:0:*:*
>  TRES=cpu=6,mem=32148M,node=1,billing=6,gres/gpu=1,gres/gpu:gv100=1
>    Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
>      Nodes=computelab-134 CPU_IDs=0-5 Mem=32148 GRES_IDX=gpu:gv100(IDX:0)
>    MinCPUsNode=6 MinMemoryNode=32148M MinTmpDiskNode=0
>    Features=(null) DelayBoot=00:00:00
>    Gres=gpu:gv100:1 Reservation=(null)
>    OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
>    Command=/bin/bash
>    WorkDir=/home/rradmer
>    Power=
>
> $ scontrol -d show job 100817
> JobId=100817 JobName=bash
>    UserId=rradmer(27578) GroupId=hardware(30) MCS_label=N/A
>    Priority=1 Nice=0 Account=cag QOS=normal
>    JobState=PENDING Reason=Resources Dependency=(null)
>    Requeue=1 Restarts=0 BatchFlag=0 Reboot=0 ExitCode=0:0
>    DerivedExitCode=0:0
>    RunTime=00:00:00 TimeLimit=02:00:00 TimeMin=N/A
>    SubmitTime=2019-04-02T05:13:11 EligibleTime=2019-04-02T05:13:11
>    StartTime=2019-04-02T07:13:05 EndTime=2019-04-02T09:13:05 Deadline=N/A
>    PreemptTime=None SuspendTime=None SecsPreSuspend=0
>    LastSchedEval=2019-04-02T05:20:44
>    Partition=test-backfill AllocNode:Sid=computelab-frontend-03:21736
>    ReqNodeList=computelab-134 ExcNodeList=(null)
>    NodeList=(null) SchedNodeList=computelab-134
>    NumNodes=1-1 NumCPUs=6 NumTasks=1 CPUs/Task=6 ReqB:S:C:T=0:0:*:*
>  TRES=cpu=6,mem=32148M,node=1,gres/gpu=1,gres/gpu:gv100=1
>    Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
>    MinCPUsNode=6 MinMemoryNode=32148M MinTmpDiskNode=0
>    Features=(null) DelayBoot=00:00:00
>    Gres=gpu:gv100:1 Reservation=(null)
>    OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
>    Command=/bin/bash
>    WorkDir=/home/rradmer
>    Power=
>
> $ scontrol -d show job 100818
> JobId=100818 JobName=bash
>    UserId=rradmer(27578) GroupId=hardware(30) MCS_label=N/A
>    Priority=1 Nice=0 Account=cag QOS=normal
>    JobState=PENDING Reason=Resources Dependency=(null)
>    Requeue=1 Restarts=0 BatchFlag=0 Reboot=0 ExitCode=0:0
>    DerivedExitCode=0:0
>    RunTime=00:00:00 TimeLimit=02:00:00 TimeMin=N/A
>    SubmitTime=2019-04-02T05:13:12 EligibleTime=2019-04-02T05:13:12
>    StartTime=2019-04-02T09:13:00 EndTime=2019-04-02T11:13:00 Deadline=N/A
>    PreemptTime=None SuspendTime=None SecsPreSuspend=0
>    LastSchedEval=2019-04-02T05:21:32
>    Partition=test-backfill AllocNode:Sid=computelab-frontend-02:12826
>    ReqNodeList=computelab-134 ExcNodeList=(null)
>    NodeList=(null) SchedNodeList=computelab-134
>    NumNodes=1-1 NumCPUs=6 NumTasks=1 CPUs/Task=6 ReqB:S:C:T=0:0:*:*
>  TRES=cpu=6,mem=32148M,node=1,gres/gpu=1,gres/gpu:tu104=1
>    Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
>    MinCPUsNode=6 MinMemoryNode=32148M MinTmpDiskNode=0
>    Features=(null) DelayBoot=00:00:00
>    Gres=gpu:tu104:1 Reservation=(null)
>    OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
>    Command=/bin/bash
>    WorkDir=/home/rradmer
>    Power=
>
>
> On Mon, Apr 1, 2019 at 11:24 PM Marcus Wagner 
> <wagner at itc.rwth-aachen.de <mailto:wagner at itc.rwth-aachen.de>> wrote:
>
>     Dear Randall,
>
>     could you please also provide
>
>
>     scontrol -d show node computelab-134
>     scontrol -d show job 100091
>     scontrol -d show job 100094
>
>
>     Best
>     Marcus
>
>     On 4/1/19 4:31 PM, Randall Radmer wrote:
>>
>>     I can’t get backfill to work for a machine with two GPUs (one is
>>     a P4 and the other a T4).
>>
>>     Submitting jobs works as expected: if the GPU I request is free,
>>     then my job runs, otherwise it goes into a pending state.  But if
>>     I have pending jobs for one GPU ahead of pending jobs for the
>>     other GPU, I see blocking issues.
>>
>>
>>     More specifically, I can create a case where I am running a job
>>     on each of the GPUs and have a pending job waiting for the P4
>>     followed by a pending job waiting for a T4.  I would expect that
>>     if I exit the running T4 job, then backfill would start the
>>     pending T4 job, even though it has to job ahead of the pending P4
>>     job. This does not happen...
>>
>>
>>     The following shows my jobs after I exited from a running T4 job,
>>     which had ID 100092:
>>
>>     $ squeue --noheader -u rradmer
>>     --Format=jobid,state,gres,nodelist,reason | sed 's/  */ /g' | sort
>>
>>     100091 RUNNING gpu:gv100:1 computelab-134 None
>>
>>     100093 PENDING gpu:gv100:1 Resources
>>
>>     100094 PENDING gpu:tu104:1 Resources
>>
>>
>>     I can find no reason why 100094  doesn’t start running (I’ve
>>     waited up to an hour, just to make sure).
>>
>>
>>     System config info and log snippets shown below.
>>
>>
>>     Thanks much,
>>
>>     Randy
>>
>>
>>     Node state corresponding to the squeue command, shown above:
>>
>>     $ scontrol show node computelab-134 | grep -i [gt]res
>>
>>       Gres=gpu:gv100:1,gpu:tu104:1
>>
>>       CfgTRES=cpu=12,mem=64307M,billing=12,gres/gpu=2,gres/gpu:gv100=1,gres/gpu:tu104=1
>>
>>       AllocTRES=cpu=6,mem=32148M,gres/gpu=1,gres/gpu:gv100=1
>>
>>
>>
>>     Slurm config follows:
>>
>>     $ scontrol show conf | grep -Ei '(gres|^Sched|prio|vers)'
>>
>>     AccountingStorageTRES =
>>     cpu,mem,energy,node,billing,gres/gpu,gres/gpu:gp100,gres/gpu:gp104,gres/gpu:gv100,gres/gpu:tu102,gres/gpu:tu104,gres/gpu:tu106
>>
>>     GresTypes               = gpu
>>
>>     PriorityParameters      = (null)
>>
>>     PriorityDecayHalfLife   = 7-00:00:00
>>
>>     PriorityCalcPeriod      = 00:05:00
>>
>>     PriorityFavorSmall      = No
>>
>>     PriorityFlags           =
>>
>>     PriorityMaxAge          = 7-00:00:00
>>
>>     PriorityUsageResetPeriod = NONE
>>
>>     PriorityType            = priority/multifactor
>>
>>     PriorityWeightAge       = 0
>>
>>     PriorityWeightFairShare = 0
>>
>>     PriorityWeightJobSize   = 0
>>
>>     PriorityWeightPartition = 0
>>
>>     PriorityWeightQOS       = 0
>>
>>     PriorityWeightTRES      = (null)
>>
>>     PropagatePrioProcess    = 0
>>
>>     SchedulerParameters     =
>>     default_queue_depth=2000,bf_continue,bf_ignore_newly_avail_nodes,bf_max_job_test=1000,bf_window=10080,kill_invalid_depend
>>
>>     SchedulerTimeSlice      = 30 sec
>>
>>     SchedulerType           = sched/backfill
>>
>>     SLURM_VERSION           = 17.11.9-2
>>
>>
>>     GPUs on node:
>>
>>     $ nvidia-smi --query-gpu=index,name,gpu_bus_id --format=csv
>>
>>     index, name, pci.bus_id
>>
>>     0, Tesla T4, 00000000:82:00.0
>>
>>     1, Tesla P4, 00000000:83:00.0
>>
>>     The gres file on node:
>>
>>     $ cat /etc/slurm/gres.conf
>>
>>     Name=gpu Type=tu104 File=/dev/nvidia0 Cores=0,1,2,3,4,5
>>
>>     Name=gpu Type=gp104 File=/dev/nvidia1 Cores=6,7,8,9,10,11
>>
>>
>>     Random sample of SlurmSchedLogFile:
>>
>>     $ sudo tail -3 slurm.sched.log
>>
>>     [2019-04-01T08:14:23.727] sched: Running job scheduler
>>
>>     [2019-04-01T08:14:23.728] sched: JobId=100093. State=PENDING.
>>     Reason=Resources. Priority=1. Partition=test-backfill.
>>
>>     [2019-04-01T08:14:23.728] sched: JobId=100094. State=PENDING.
>>     Reason=Resources. Priority=1. Partition=test-backfill.
>>
>>
>>     Random sample of SlurmctldLogFile:
>>
>>     $ sudo grep backfill slurmctld.log  | tail -5
>>
>>     [2019-04-01T08:16:53.281] backfill: beginning
>>
>>     [2019-04-01T08:16:53.281] backfill test for JobID=100093 Prio=1
>>     Partition=test-backfill
>>
>>     [2019-04-01T08:16:53.281] backfill test for JobID=100094 Prio=1
>>     Partition=test-backfill
>>
>>     [2019-04-01T08:16:53.281] backfill: reached end of job queue
>>
>>     [2019-04-01T08:16:53.281] backfill: completed testing 2(2) jobs,
>>     usec=707
>>
>
>     -- 
>     Marcus Wagner, Dipl.-Inf.
>
>     IT Center
>     Abteilung: Systeme und Betrieb
>     RWTH Aachen University
>     Seffenter Weg 23
>     52074 Aachen
>     Tel: +49 241 80-24383
>     Fax: +49 241 80-624383
>     wagner at itc.rwth-aachen.de  <mailto:wagner at itc.rwth-aachen.de>
>     www.itc.rwth-aachen.de  <http://www.itc.rwth-aachen.de>
>

-- 
Marcus Wagner, Dipl.-Inf.

IT Center
Abteilung: Systeme und Betrieb
RWTH Aachen University
Seffenter Weg 23
52074 Aachen
Tel: +49 241 80-24383
Fax: +49 241 80-624383
wagner at itc.rwth-aachen.de
www.itc.rwth-aachen.de

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20190403/241d9e68/attachment-0001.html>


More information about the slurm-users mailing list