[slurm-users] disable-bindings disables counting of gres resources

Fri Apr 5 13:31:56 UTC 2019

Same problem here: a Job submitted with gres-flags=disable-bindings is 
assigned a node, but then the job step fails because all GPUs on that 
node are already in use. Log messages:

[2019-04-05T15:29:05.216] error: gres/gpu: job 92453 node node5 
overallocated resources by 1, (9 > 8)
[2019-04-05T15:29:05.216] Gres topology sub-optimal for job 92453
[2019-04-05T15:29:05.217] sched: _slurm_rpc_allocate_resources 
JobId=92453 NodeList=node5 usec=497

Am 25.03.19 um 10:30 schrieb Peter Steinbach:
> Dear all,
> 
> Using these config files,
> 
> https://github.com/psteinb/docker-centos7-slurm/blob/7bdb89161febacfd2dbbcb3c5684336fb73d7608/gres.conf 
> 
> 
> https://github.com/psteinb/docker-centos7-slurm/blob/7bdb89161febacfd2dbbcb3c5684336fb73d7608/slurm.conf 
> 
> 
> I observed a weird behavior of the '--gres-flags=disable-binding' 
> option. With the above .conf files, I created a local slurm cluster with 
> 3 computes (2 GPUs and 4 cores each).
> 
> # sinfo -N -l
> Mon Mar 25 09:20:59 2019
> NODELIST   NODES PARTITION       STATE CPUS    S:C:T MEMORY TMP_DISK 
> WEIGHT AVAIL_FE REASON
> g1             1      gpu*        idle    4    1:4:1   4000        0 
>   1   (null) none
> g2             1      gpu*        idle    4    1:4:1   4000        0 
>   1   (null) none
> g3             1      gpu*        idle    4    1:4:1   4000        0 
>   1   (null) none
> 
> I first submitted 3 jobs that consume all available GPUs:
> 
> # sbatch --gres=gpu:2 --wrap="env && sleep 600" -o block_2gpus_%A.out 
> --mem=500
> Submitted batch job 2
> # sbatch --gres=gpu:2 --wrap="env && sleep 600" -o block_2gpus_%A.out 
> --mem=500
> Submitted batch job 3
> # sbatch --gres=gpu:2 --wrap="env && sleep 600" -o block_2gpus_%A.out 
> --mem=500
> Submitted batch job 4
> # squeue
>               JOBID PARTITION     NAME     USER ST       TIME  NODES 
> NODELIST(REASON)
>                   5       gpu     wrap     root  R       0:04      1 g1
>                   6       gpu     wrap     root  R       0:01      1 g2
>                   7       gpu     wrap     root  R       0:01      1 g3
> 
> Funny enough, if I send a job with only one gpu and add 
> --gres-flags=disable-binding it actually starts running.
> 
> # sbatch --gres=gpu:1 --wrap="env && sleep 30" -o use_1gpu_%A.out 
> --mem=500 --gres-flags=disable-binding
> Submitted batch job 9
> [root at ernie /]# squeue
>               JOBID PARTITION     NAME     USER ST       TIME  NODES 
> NODELIST(REASON)
>                   5       gpu     wrap     root  R       1:44      1 g1
>                   6       gpu     wrap     root  R       1:41      1 g2
>                   7       gpu     wrap     root  R       1:41      1 g3
>                   9       gpu     wrap     root  R       0:02      1 g1
> 
> I am not sure what to think of this. I consider this behavior not ideal 
> as our users reported that their jobs die due to insufficient GPU memory 
> avialble. Which is obvious, as the already present GPU jobs are using 
> the GPUs (as they should).
> 
> I am a bit lost here. slurm is as clever as to NOT SET 
> CUDA_VISIBLE_DEVICES for the job that has 
> '--gres-flags=disable-binding', but that doesn't help our users.
> 
> Personally, I believe this is a bug, but I would love to get feedback 
> from other slurm users/developers.
> 
> Thanks in advance -
> P
> 
> # scontrol show Nodes g1
> NodeName=g1 CoresPerSocket=4
>     CPUAlloc=1 CPUTot=4 CPULoad=N/A
>     AvailableFeatures=(null)
>     ActiveFeatures=(null)
>     Gres=gpu:titanxp:2
>     NodeAddr=127.0.0.1 NodeHostName=localhost Port=0
>     RealMemory=4000 AllocMem=500 FreeMem=N/A Sockets=1 Boards=1
>     State=MIXED ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
>     Partitions=gpu
>     BootTime=2019-03-18T10:14:18 SlurmdStartTime=2019-03-25T09:20:57
>     CfgTRES=cpu=4,mem=4000M,billing=4
>     AllocTRES=cpu=1,mem=500M
>     CapWatts=n/a
>     CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
>     ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
> 
> JobId=5 JobName=wrap
>     UserId=root(0) GroupId=root(0) MCS_label=N/A
>     Priority=4294901756 Nice=0 Account=(null) QOS=normal
>     JobState=RUNNING Reason=None Dependency=(null)
>     Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
>     DerivedExitCode=0:0
>     RunTime=00:06:30 TimeLimit=5-00:00:00 TimeMin=N/A
>     SubmitTime=2019-03-25T09:23:13 EligibleTime=2019-03-25T09:23:13
>     AccrueTime=Unknown
>     StartTime=2019-03-25T09:23:13 EndTime=2019-03-30T09:23:13 Deadline=N/A
>     PreemptTime=None SuspendTime=None SecsPreSuspend=0
>     LastSchedEval=2019-03-25T09:23:13
>     Partition=gpu AllocNode:Sid=ernie:1
>     ReqNodeList=(null) ExcNodeList=(null)
>     NodeList=g1
>     BatchHost=localhost
>     NumNodes=1 NumCPUs=1 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
>     TRES=cpu=1,mem=500M,node=1,billing=1
>     Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
>       Nodes=g1 CPU_IDs=0 Mem=500 GRES_IDX=gpu(IDX:0-1)
>     MinCPUsNode=1 MinMemoryNode=500M MinTmpDiskNode=0
>     Features=(null) DelayBoot=00:00:00
>     OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
>     Command=(null)
>     WorkDir=/
>     StdErr=//block_2gpus_5.out
>     StdIn=/dev/null
>     StdOut=//block_2gpus_5.out
>     Power=
>     TresPerNode=gpu:2
> 
> JobId=10 JobName=wrap
>     UserId=root(0) GroupId=root(0) MCS_label=N/A
>     Priority=4294901751 Nice=0 Account=(null) QOS=normal
>     JobState=RUNNING Reason=None Dependency=(null)
>     Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
>     DerivedExitCode=0:0
>     RunTime=00:00:07 TimeLimit=5-00:00:00 TimeMin=N/A
>     SubmitTime=2019-03-25T09:29:12 EligibleTime=2019-03-25T09:29:12
>     AccrueTime=Unknown
>     StartTime=2019-03-25T09:29:12 EndTime=2019-03-30T09:29:12 Deadline=N/A
>     PreemptTime=None SuspendTime=None SecsPreSuspend=0
>     LastSchedEval=2019-03-25T09:29:12
>     Partition=gpu AllocNode:Sid=ernie:1
>     ReqNodeList=(null) ExcNodeList=(null)
>     NodeList=g1
>     BatchHost=localhost
>     NumNodes=1 NumCPUs=1 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
>     TRES=cpu=1,mem=500M,node=1,billing=1
>     Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
>       Nodes=g1 CPU_IDs=1 Mem=500 GRES_IDX=gpu(IDX:)
>     MinCPUsNode=1 MinMemoryNode=500M MinTmpDiskNode=0
>     Features=(null) DelayBoot=00:00:00
>     OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
>     Command=(null)
>     WorkDir=/
>     StdErr=//use_1gpu_10.out
>     StdIn=/dev/null
>     StdOut=//use_1gpu_10.out
>     Power=
>     GresEnforceBind=No
>     TresPerNode=gpu:1
> 

-- 
Quirin Lohr
Systemadministration
Technische Universität München
Fakultät für Informatik
Lehrstuhl für Bildverarbeitung und Mustererkennung

Boltzmannstrasse 3
85748 Garching

Tel. +49 89 289 17769
Fax +49 89 289 17757

quirin.lohr at in.tum.de
www.vision.in.tum.de

-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 5565 bytes
Desc: S/MIME Cryptographic Signature
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20190405/782061fe/attachment.bin>