<div dir="ltr">Hello,<br><br>we are experiencing troubles with gang scheduling once GPUs are added in the consideration. We are using the following slurm.conf settings: <br><br>ProctrackType=proctrack/cgroup<br>TaskPlugin=task/cgroup<br><br>SchedulerType=sched/backfill<br>SchedulerTimeSlice=60<br>SelectType=select/cons_tres<br>SelectTypeParameters=CR_CPU_Memory<br><br>PreemptType=preempt/qos<br>PreemptMode=SUSPEND, GANG<br>PreemptExemptTime=-1<br><br>NodeName=cn2 Sockets=4 CoresPerSocket=4 ThreadsPerCore=2 RealMemory=56000 Gres=gpu:geforce_gtx_1080_ti:2<br>[...]<br>PartitionName=main Nodes=cn2,cn3,cn4 Default=YES MaxTime=INFINITE State=UP OverSubscribe=FORCE:4<br>[...]<br>----------------------------------------------------------------------------<br><br>We use the QoS-based preemption to run lower priority tasks getting pre-empted automatically when higher priority tasks arrive in the queue, which works nicely. When we run several GPU tasks using sbatch with a script as shown below, we see that these tasks don't get gang-scheduled, without any apparent error message in the logs. For jobs involving only CPUs it works as expected. <br>We didn't see any specific comments regarding GPUs in the gang scheduling documentation - are we trying something which is not supported or are we doing it wrong? Also, is there a way to obtain more detailed logs/insights into how the system practically decides when to form a gang or not? <br><br>#!/bin/bash<br>#SBATCH --nodes=1<br>#SBATCH --gpus=2<br>#SBATCH --partition=main,interactive<br><br>IMAGES_DIR="/path/to/images"<br>IMAGE="<a href="http://nvcr.io/nvidia/cuda:10.0-base">nvcr.io/nvidia/cuda:10.0-base</a>"<br><br><div>srun --container-image="$IMAGES_DIR/$IMAGE.sqsh" bash ...</div><div>----------------------------------------------------------------------------</div><div><br></div><div>Thanks for reading & have a nice weekend</div><div>Tilman<br></div></div>