I didn't mention that I have:

SchedulerType=sched/backfill

in slurm.conf

I am reading https://slurm.schedmd.com/sched_config.html where it is written:

If the job under consideration can start immediately without impacting the expected start time of any higher priority job, then it does so

but is is also written something that I am not able to fully understand:

For performance reasons, the backfill scheduler reserves whole nodes for jobs, even if jobs don't require whole nodes

Does this mean that the worker nodes listed in the "squeue --start" output [*] are basically not usable until those jobs will start running ?

This would explain my problem, but I don't understand the logic of this behavior

Thanks again

Massimo

[*]

[sgaravat@cld-ter-ui-01 ~]$ squeue --start
JOBID PARTITION NAME USER ST START_TIME NODES SCHEDNODES NODELIST(REASON)
283565 gpus vllm-pod ciangott PD 2026-04-14T14:22:38 1 cld-ter-gpu-01 (Resources)
284080 gpus,gpus qed_supe catalano PD 2026-04-14T21:26:22 1 cld-ter-gpu-05 (Resources)
284081 gpus,gpus qed_supe catalano PD 2026-04-14T22:32:57 1 cld-ter-gpu-04 (Priority)
284099 gpus,gpus qed_supe catalano PD 2026-04-15T07:19:36 1 cld-ter-gpu-03 (Priority)
284119 onlycpus- myscript sgaravat PD 2026-04-15T11:12:26 1 btc-dfa-gpu-02 (Nodes required for job are DOWN, DRAINED or reserved for jobs in higher priority partitions)
284121 onlycpus- myscript sgaravat PD 2026-04-15T11:12:26 1 cld-dfa-gpu-06 (Priority)
284090 gpus,gpus long_isi pavesic PD N/A 1 (null) (Priority)
284091 gpus,gpus long_isi pavesic PD N/A 1 (null) (Priority)
284092 gpus,gpus long_isi pavesic PD N/A 1 (null) (Priority)
284093 gpus,gpus long_isi pavesic PD N/A 1 (null) (Priority)
284094 gpus,gpus long_isi pavesic PD N/A 1 (null) (Priority)
284095 gpus,gpus long_isi pavesic PD N/A 1 (null) (Priority)
284104 gpus,gpus qed_supe catalano PD N/A 1 (null) (Priority)

On Mon, Apr 13, 2026 at 1:02 PM Massimo Sgaravatto <massimo.sgaravatto@gmail.com> wrote:

Dear all

I (try to) manage a slurm cluster composed by some CPU-only nodes and some worker nodes which have also GPUs:

NodeName=cld-ter-[01-06] Sockets=2 CoresPerSocket=96 ThreadsPerCore=2 RealMemory=1536000 State=UNKNOWN
NodeName=cld-ter-gpu-[01-05] Sockets=2 CoresPerSocket=96 ThreadsPerCore=2 Gres=gpu:nvidia-h100:4 RealMemory=1536000 State=UNKNOWN

The GPU nodes are exposed through multiple partitions:

PartitionName=gpus Nodes=cld-ter-gpu-[01-02] State=UP PriorityTier=20
PartitionName=sparch Nodes=cld-ter-gpu-03 AllowAccounts=sparch,operators QoS=sparch State=UP PriorityTier=20
PartitionName=geant4 Nodes=cld-ter-gpu-03 AllowAccounts=geant4,operators QoS=geant4 State=UP PriorityTier=20
PartitionName=enipred Nodes=cld-ter-gpu-04 AllowAccounts=enipred,operators QoS=enipred State=UP PriorityTier=20
PartitionName=enipiml Nodes=cld-ter-gpu-05 AllowAccounts=enipiml,operators QoS=enipiml State=UP PriorityTier=20

We also set a partition to allow cpu-only jobs on the GPU nodes, but these jobs should be preempted (killed and requeued) if jobs submitted to partitions with higher priorities require those resources:

PreemptType=preempt/partition_prio
PreemptMode=REQUEUE
PartitionName=onlycpus-opp Nodes=cld-ter-gpu-[01-05],cld-dfa-gpu-06,btc-dfa-gpu-02 State=UP PriorityTier=10

Now, I don't understand why this job [*] submitted on the onlycpus-opp partition can't start running e.g. on the cld-ter-gpu-01, since it has a lot of free resources:

[sgaravat@cld-ter-ui-01 ~]$ scontrol show node cld-ter-gpu-01
NodeName=cld-ter-gpu-01 Arch=x86_64 CoresPerSocket=96
CPUAlloc=8 CPUEfctv=384 CPUTot=384 CPULoad=5.93
AvailableFeatures=(null)
ActiveFeatures=(null)
Gres=gpu:nvidia-h100:4
NodeAddr=cld-ter-gpu-01 NodeHostName=cld-ter-gpu-01 Version=25.11.3
OS=Linux 5.14.0-611.45.1.el9_7.x86_64 #1 SMP PREEMPT_DYNAMIC Wed Apr 1 05:56:53 EDT 2026
RealMemory=1536000 AllocMem=560000 FreeMem=1192357 Sockets=2 Boards=1
State=MIXED+PLANNED ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
Partitions=gpus,onlycpus-opp
BootTime=2026-04-09T10:39:35 SlurmdStartTime=2026-04-09T10:40:01
LastBusyTime=2026-04-09T11:54:46 ResumeAfterTime=None
CfgTRES=cpu=384,mem=1500G,billing=839,gres/gpu=4,gres/gpu:nvidia-h100=4
AllocTRES=cpu=8,mem=560000M,gres/gpu=4,gres/gpu:nvidia-h100=4
CurrentWatts=0 AveWatts=0

I guess the "MIXED+PLANNED" is the answer, but as far as I can see only a job (283469) is planned for this worker node:

sgaravat@cld-ter-ui-01 ~]$ squeue --start | grep ter-gpu-01
JOBID PARTITION NAME USER ST START_TIME NODES SCHEDNODES NODELIST(REASON)

283469 gpus vllm-pod ciangott PD 2026-04-13T14:31:40 1 cld-ter-gpu-01 (Resources)

But job 283469 doesn't require too many resources [**], so the 2 jobs could run together. Why job 283534 can't start ?
Any hints ?

Thanks, Massimo

[*]

[sgaravat@cld-ter-ui-01 ~]$ scontrol show job=283534
JobId=283534 JobName=myscript.sh
UserId=sgaravat(5008) GroupId=tbadmin(5001) MCS_label=N/A
Priority=542954 Nice=0 Account=operators QOS=normal
JobState=RUNNING Reason=None Dependency=(null)
Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
RunTime=00:00:41 TimeLimit=1-00:00:00 TimeMin=N/A
SubmitTime=2026-04-13T11:10:13 EligibleTime=2026-04-13T11:10:13
AccrueTime=2026-04-13T11:10:13
StartTime=2026-04-13T11:58:39 EndTime=2026-04-14T11:58:39 Deadline=N/A
PreemptEligibleTime=2026-04-13T11:58:39 PreemptTime=None
SuspendTime=None SecsPreSuspend=0 LastSchedEval=2026-04-13T11:58:39 Scheduler=Backfill
Partition=onlycpus-opp AllocNode:Sid=cld-ter-ui-01:3035857
ReqNodeList=(null) ExcNodeList=(null)
NodeList=btc-dfa-gpu-02
BatchHost=btc-dfa-gpu-02
NumNodes=1 NumCPUs=1 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
ReqTRES=cpu=1,mem=100G,node=1,billing=26
AllocTRES=cpu=1,mem=100G,node=1,billing=26
Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
MinCPUsNode=1 MinMemoryNode=100G MinTmpDiskNode=0
Features=(null) DelayBoot=00:00:00
OverSubscribe=OK Contiguous=0 Licenses=(null) LicensesAlloc=(null) Network=(null)
Command=/shared/home/sgaravat/myscript.sh
SubmitLine=sbatch myscript.sh
WorkDir=/shared/home/sgaravat
StdErr=/shared/home/sgaravat/JOB-myscript.sh.283534.4294967294.err
StdIn=/dev/null
StdOut=/shared/home/sgaravat/JOB-myscript.sh.283534.4294967294.out
MailUser=massimo.sgaravatto@pd.infn.it MailType=INVALID_DEPEND,BEGIN,END,FAIL,REQUEUE,STAGE_OUT

[**]
sgaravat@cld-ter-ui-01 ~]$ scontrol show job=283469
JobId=283469 JobName=vllm-pod
UserId=ciangott(6054) GroupId=tbuser(6000) MCS_label=N/A
Priority=499703 Nice=0 Account=cms QOS=normal
JobState=PENDING Reason=Resources Dependency=(null)
Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
RunTime=00:00:00 TimeLimit=1-00:00:00 TimeMin=N/A
SubmitTime=2026-04-13T06:48:37 EligibleTime=2026-04-13T06:48:37
AccrueTime=2026-04-13T06:48:37
StartTime=2026-04-13T14:31:40 EndTime=2026-04-14T14:31:40 Deadline=N/A
SuspendTime=None SecsPreSuspend=0 LastSchedEval=2026-04-13T11:59:48 Scheduler=Main
Partition=gpus AllocNode:Sid=cld-ter-ui-01:3015801
ReqNodeList=(null) ExcNodeList=(null)
NodeList= SchedNodeList=cld-ter-gpu-01
NumNodes=1-1 NumCPUs=32 NumTasks=1 CPUs/Task=32 ReqB:S:C:T=0:0:*:*
ReqTRES=cpu=32,mem=190734M,node=1,billing=118,gres/gpu=2,gres/gpu:nvidia-h100=2
AllocTRES=(null)
Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
MinCPUsNode=32 MinMemoryNode=190734M MinTmpDiskNode=0
Features=(null) DelayBoot=00:00:00
OverSubscribe=OK Contiguous=0 Licenses=(null) LicensesAlloc=(null) Network=(null)
Command=.interlink/jobs/default-0c0257f8-d1ea-4135-a602-96c229ce8516/job.slurm
SubmitLine=sbatch .interlink/jobs/default-0c0257f8-d1ea-4135-a602-96c229ce8516/job.slurm
WorkDir=/shared/home/ciangott
StdErr=
StdIn=/dev/null
StdOut=/shared/home/ciangott/.interlink/jobs/default-0c0257f8-d1ea-4135-a602-96c229ce8516/job.out
TresPerNode=gres/gpu:nvidia-h100:2
TresPerTask=cpu=32