Dear all I (try to) manage a slurm cluster composed by some CPU-only nodes and some worker nodes which have also GPUs: NodeName=cld-ter-[01-06] Sockets=2 CoresPerSocket=96 ThreadsPerCore=2 RealMemory=1536000 State=UNKNOWN NodeName=cld-ter-gpu-[01-05] Sockets=2 CoresPerSocket=96 ThreadsPerCore=2 Gres=gpu:nvidia-h100:4 RealMemory=1536000 State=UNKNOWN The GPU nodes are exposed through multiple partitions: PartitionName=gpus Nodes=cld-ter-gpu-[01-02] State=UP PriorityTier=20 PartitionName=sparch Nodes=cld-ter-gpu-03 AllowAccounts=sparch,operators QoS=sparch State=UP PriorityTier=20 PartitionName=geant4 Nodes=cld-ter-gpu-03 AllowAccounts=geant4,operators QoS=geant4 State=UP PriorityTier=20 PartitionName=enipred Nodes=cld-ter-gpu-04 AllowAccounts=enipred,operators QoS=enipred State=UP PriorityTier=20 PartitionName=enipiml Nodes=cld-ter-gpu-05 AllowAccounts=enipiml,operators QoS=enipiml State=UP PriorityTier=20 We also set a partition to allow cpu-only jobs on the GPU nodes, but these jobs should be preempted (killed and requeued) if jobs submitted to partitions with higher priorities require those resources: PreemptType=preempt/partition_prio PreemptMode=REQUEUE PartitionName=onlycpus-opp Nodes=cld-ter-gpu-[01-05],cld-dfa-gpu-06,btc-dfa-gpu-02 State=UP PriorityTier=10 Now, I don't understand why this job [*] submitted on the onlycpus-opp partition can't start running e.g. on the cld-ter-gpu-01, since it has a lot of free resources: [sgaravat@cld-ter-ui-01 ~]$ scontrol show node cld-ter-gpu-01 NodeName=cld-ter-gpu-01 Arch=x86_64 CoresPerSocket=96 CPUAlloc=8 CPUEfctv=384 CPUTot=384 CPULoad=5.93 AvailableFeatures=(null) ActiveFeatures=(null) Gres=gpu:nvidia-h100:4 NodeAddr=cld-ter-gpu-01 NodeHostName=cld-ter-gpu-01 Version=25.11.3 OS=Linux 5.14.0-611.45.1.el9_7.x86_64 #1 SMP PREEMPT_DYNAMIC Wed Apr 1 05:56:53 EDT 2026 RealMemory=1536000 AllocMem=560000 FreeMem=1192357 Sockets=2 Boards=1 State=MIXED+PLANNED ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A Partitions=gpus,onlycpus-opp BootTime=2026-04-09T10:39:35 SlurmdStartTime=2026-04-09T10:40:01 LastBusyTime=2026-04-09T11:54:46 ResumeAfterTime=None CfgTRES=cpu=384,mem=1500G,billing=839,gres/gpu=4,gres/gpu:nvidia-h100=4 AllocTRES=cpu=8,mem=560000M,gres/gpu=4,gres/gpu:nvidia-h100=4 CurrentWatts=0 AveWatts=0 I guess the "MIXED+PLANNED" is the answer, but as far as I can see only a job (283469) is planned for this worker node: sgaravat@cld-ter-ui-01 ~]$ squeue --start | grep ter-gpu-01 JOBID PARTITION NAME USER ST START_TIME NODES SCHEDNODES NODELIST(REASON) 283469 gpus vllm-pod ciangott PD 2026-04-13T14:31:40 1 cld-ter-gpu-01 (Resources) But job 283469 doesn't require too many resources [**], so the 2 jobs could run together. Why job 283534 can't start ? Any hints ? Thanks, Massimo [*] [sgaravat@cld-ter-ui-01 ~]$ scontrol show job=283534 JobId=283534 JobName=myscript.sh UserId=sgaravat(5008) GroupId=tbadmin(5001) MCS_label=N/A Priority=542954 Nice=0 Account=operators QOS=normal JobState=RUNNING Reason=None Dependency=(null) Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0 RunTime=00:00:41 TimeLimit=1-00:00:00 TimeMin=N/A SubmitTime=2026-04-13T11:10:13 EligibleTime=2026-04-13T11:10:13 AccrueTime=2026-04-13T11:10:13 StartTime=2026-04-13T11:58:39 EndTime=2026-04-14T11:58:39 Deadline=N/A PreemptEligibleTime=2026-04-13T11:58:39 PreemptTime=None SuspendTime=None SecsPreSuspend=0 LastSchedEval=2026-04-13T11:58:39 Scheduler=Backfill Partition=onlycpus-opp AllocNode:Sid=cld-ter-ui-01:3035857 ReqNodeList=(null) ExcNodeList=(null) NodeList=btc-dfa-gpu-02 BatchHost=btc-dfa-gpu-02 NumNodes=1 NumCPUs=1 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:* ReqTRES=cpu=1,mem=100G,node=1,billing=26 AllocTRES=cpu=1,mem=100G,node=1,billing=26 Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=* MinCPUsNode=1 MinMemoryNode=100G MinTmpDiskNode=0 Features=(null) DelayBoot=00:00:00 OverSubscribe=OK Contiguous=0 Licenses=(null) LicensesAlloc=(null) Network=(null) Command=/shared/home/sgaravat/myscript.sh SubmitLine=sbatch myscript.sh WorkDir=/shared/home/sgaravat StdErr=/shared/home/sgaravat/JOB-myscript.sh.283534.4294967294.err StdIn=/dev/null StdOut=/shared/home/sgaravat/JOB-myscript.sh.283534.4294967294.out MailUser=massimo.sgaravatto@pd.infn.it MailType=INVALID_DEPEND,BEGIN,END,FAIL,REQUEUE,STAGE_OUT [**] sgaravat@cld-ter-ui-01 ~]$ scontrol show job=283469 JobId=283469 JobName=vllm-pod UserId=ciangott(6054) GroupId=tbuser(6000) MCS_label=N/A Priority=499703 Nice=0 Account=cms QOS=normal JobState=PENDING Reason=Resources Dependency=(null) Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0 RunTime=00:00:00 TimeLimit=1-00:00:00 TimeMin=N/A SubmitTime=2026-04-13T06:48:37 EligibleTime=2026-04-13T06:48:37 AccrueTime=2026-04-13T06:48:37 StartTime=2026-04-13T14:31:40 EndTime=2026-04-14T14:31:40 Deadline=N/A SuspendTime=None SecsPreSuspend=0 LastSchedEval=2026-04-13T11:59:48 Scheduler=Main Partition=gpus AllocNode:Sid=cld-ter-ui-01:3015801 ReqNodeList=(null) ExcNodeList=(null) NodeList= SchedNodeList=cld-ter-gpu-01 NumNodes=1-1 NumCPUs=32 NumTasks=1 CPUs/Task=32 ReqB:S:C:T=0:0:*:* ReqTRES=cpu=32,mem=190734M,node=1,billing=118,gres/gpu=2,gres/gpu:nvidia-h100=2 AllocTRES=(null) Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=* MinCPUsNode=32 MinMemoryNode=190734M MinTmpDiskNode=0 Features=(null) DelayBoot=00:00:00 OverSubscribe=OK Contiguous=0 Licenses=(null) LicensesAlloc=(null) Network=(null) Command=.interlink/jobs/default-0c0257f8-d1ea-4135-a602-96c229ce8516/job.slurm SubmitLine=sbatch .interlink/jobs/default-0c0257f8-d1ea-4135-a602-96c229ce8516/job.slurm WorkDir=/shared/home/ciangott StdErr= StdIn=/dev/null StdOut=/shared/home/ciangott/.interlink/jobs/default-0c0257f8-d1ea-4135-a602-96c229ce8516/job.out TresPerNode=gres/gpu:nvidia-h100:2 TresPerTask=cpu=32
IIRC, you can not have jobs from two partitions running concurrently on the same node, the requested resources are irrelevant. Seems a node can only be in a single partition at a time. Diego Il 13/04/26 13:02, Massimo Sgaravatto via slurm-users ha scritto:
Dear all
I (try to) manage a slurm cluster composed by some CPU-only nodes and some worker nodes which have also GPUs:
NodeName=cld-ter-[01-06] Sockets=2 CoresPerSocket=96 ThreadsPerCore=2 RealMemory=1536000 State=UNKNOWN NodeName=cld-ter-gpu-[01-05] Sockets=2 CoresPerSocket=96 ThreadsPerCore=2 Gres=gpu:nvidia-h100:4 RealMemory=1536000 State=UNKNOWN
The GPU nodes are exposed through multiple partitions:
PartitionName=gpus Nodes=cld-ter-gpu-[01-02] State=UP PriorityTier=20 PartitionName=sparch Nodes=cld-ter-gpu-03 AllowAccounts=sparch,operators QoS=sparch State=UP PriorityTier=20 PartitionName=geant4 Nodes=cld-ter-gpu-03 AllowAccounts=geant4,operators QoS=geant4 State=UP PriorityTier=20 PartitionName=enipred Nodes=cld-ter-gpu-04 AllowAccounts=enipred,operators QoS=enipred State=UP PriorityTier=20 PartitionName=enipiml Nodes=cld-ter-gpu-05 AllowAccounts=enipiml,operators QoS=enipiml State=UP PriorityTier=20
We also set a partition to allow cpu-only jobs on the GPU nodes, but these jobs should be preempted (killed and requeued) if jobs submitted to partitions with higher priorities require those resources:
PreemptType=preempt/partition_prio PreemptMode=REQUEUE PartitionName=onlycpus-opp Nodes=cld-ter-gpu-[01-05],cld-dfa-gpu-06,btc- dfa-gpu-02 State=UP PriorityTier=10
Now, I don't understand why this job [*] submitted on the onlycpus-opp partition can't start running e.g. on the cld-ter-gpu-01, since it has a lot of free resources:
[sgaravat@cld-ter-ui-01 ~]$ scontrol show node cld-ter-gpu-01 NodeName=cld-ter-gpu-01 Arch=x86_64 CoresPerSocket=96 CPUAlloc=8 CPUEfctv=384 CPUTot=384 CPULoad=5.93 AvailableFeatures=(null) ActiveFeatures=(null) Gres=gpu:nvidia-h100:4 NodeAddr=cld-ter-gpu-01 NodeHostName=cld-ter-gpu-01 Version=25.11.3 OS=Linux 5.14.0-611.45.1.el9_7.x86_64 #1 SMP PREEMPT_DYNAMIC Wed Apr 1 05:56:53 EDT 2026 RealMemory=1536000 AllocMem=560000 FreeMem=1192357 Sockets=2 Boards=1 State=MIXED+PLANNED ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A Partitions=gpus,onlycpus-opp BootTime=2026-04-09T10:39:35 SlurmdStartTime=2026-04-09T10:40:01 LastBusyTime=2026-04-09T11:54:46 ResumeAfterTime=None CfgTRES=cpu=384,mem=1500G,billing=839,gres/gpu=4,gres/gpu:nvidia-h100=4 AllocTRES=cpu=8,mem=560000M,gres/gpu=4,gres/gpu:nvidia-h100=4 CurrentWatts=0 AveWatts=0
I guess the "MIXED+PLANNED" is the answer, but as far as I can see only a job (283469) is planned for this worker node:
sgaravat@cld-ter-ui-01 ~]$ squeue --start | grep ter-gpu-01 JOBID PARTITION NAME USER ST START_TIME NODES SCHEDNODES NODELIST(REASON)
283469 gpus vllm-pod ciangott PD 2026-04-13T14:31:40 1 cld-ter-gpu-01 (Resources)
But job 283469 doesn't require too many resources [**], so the 2 jobs could run together. Why job 283534 can't start ? Any hints ?
Thanks, Massimo
[*]
[sgaravat@cld-ter-ui-01 ~]$ scontrol show job=283534 JobId=283534 JobName=myscript.sh UserId=sgaravat(5008) GroupId=tbadmin(5001) MCS_label=N/A Priority=542954 Nice=0 Account=operators QOS=normal JobState=RUNNING Reason=None Dependency=(null) Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0 RunTime=00:00:41 TimeLimit=1-00:00:00 TimeMin=N/A SubmitTime=2026-04-13T11:10:13 EligibleTime=2026-04-13T11:10:13 AccrueTime=2026-04-13T11:10:13 StartTime=2026-04-13T11:58:39 EndTime=2026-04-14T11:58:39 Deadline=N/A PreemptEligibleTime=2026-04-13T11:58:39 PreemptTime=None SuspendTime=None SecsPreSuspend=0 LastSchedEval=2026-04-13T11:58:39 Scheduler=Backfill Partition=onlycpus-opp AllocNode:Sid=cld-ter-ui-01:3035857 ReqNodeList=(null) ExcNodeList=(null) NodeList=btc-dfa-gpu-02 BatchHost=btc-dfa-gpu-02 NumNodes=1 NumCPUs=1 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:* ReqTRES=cpu=1,mem=100G,node=1,billing=26 AllocTRES=cpu=1,mem=100G,node=1,billing=26 Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=* MinCPUsNode=1 MinMemoryNode=100G MinTmpDiskNode=0 Features=(null) DelayBoot=00:00:00 OverSubscribe=OK Contiguous=0 Licenses=(null) LicensesAlloc=(null) Network=(null) Command=/shared/home/sgaravat/myscript.sh SubmitLine=sbatch myscript.sh WorkDir=/shared/home/sgaravat StdErr=/shared/home/sgaravat/JOB-myscript.sh.283534.4294967294.err StdIn=/dev/null StdOut=/shared/home/sgaravat/JOB-myscript.sh.283534.4294967294.out MailUser=massimo.sgaravatto@pd.infn.it <mailto:massimo.sgaravatto@pd.infn.it> MailType=INVALID_DEPEND,BEGIN,END,FAIL,REQUEUE,STAGE_OUT
[**] sgaravat@cld-ter-ui-01 ~]$ scontrol show job=283469 JobId=283469 JobName=vllm-pod UserId=ciangott(6054) GroupId=tbuser(6000) MCS_label=N/A Priority=499703 Nice=0 Account=cms QOS=normal JobState=PENDING Reason=Resources Dependency=(null) Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0 RunTime=00:00:00 TimeLimit=1-00:00:00 TimeMin=N/A SubmitTime=2026-04-13T06:48:37 EligibleTime=2026-04-13T06:48:37 AccrueTime=2026-04-13T06:48:37 StartTime=2026-04-13T14:31:40 EndTime=2026-04-14T14:31:40 Deadline=N/A SuspendTime=None SecsPreSuspend=0 LastSchedEval=2026-04-13T11:59:48 Scheduler=Main Partition=gpus AllocNode:Sid=cld-ter-ui-01:3015801 ReqNodeList=(null) ExcNodeList=(null) NodeList= SchedNodeList=cld-ter-gpu-01 NumNodes=1-1 NumCPUs=32 NumTasks=1 CPUs/Task=32 ReqB:S:C:T=0:0:*:* ReqTRES=cpu=32,mem=190734M,node=1,billing=118,gres/gpu=2,gres/ gpu:nvidia-h100=2 AllocTRES=(null) Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=* MinCPUsNode=32 MinMemoryNode=190734M MinTmpDiskNode=0 Features=(null) DelayBoot=00:00:00 OverSubscribe=OK Contiguous=0 Licenses=(null) LicensesAlloc=(null) Network=(null) Command=.interlink/jobs/default-0c0257f8-d1ea-4135- a602-96c229ce8516/job.slurm SubmitLine=sbatch .interlink/jobs/default-0c0257f8-d1ea-4135- a602-96c229ce8516/job.slurm WorkDir=/shared/home/ciangott StdErr= StdIn=/dev/null StdOut=/shared/home/ciangott/.interlink/jobs/default-0c0257f8- d1ea-4135-a602-96c229ce8516/job.out TresPerNode=gres/gpu:nvidia-h100:2 TresPerTask=cpu=32
-- Diego Zuccato DIFA - Dip. di Fisica e Astronomia Servizi Informatici Alma Mater Studiorum - Università di Bologna V.le Berti-Pichat 6/2 - 40127 Bologna - Italy tel.: +39 051 20 95786
Hi Diego, I believe that a node may run jobs from multiple partitions at the same time. Example of a node in our cluster: $ sinfo -n sd652 PARTITION AVAIL TIMELIMIT NODES STATE NODELIST (lines deleted) a100_week up 7-00:00:00 1 alloc sd652 a100 up 2-02:00:00 1 alloc sd652 I believe this was always the case (we're running Slurm 25.11.4). Best regards, Ole On 4/13/26 13:54, Diego Zuccato via slurm-users wrote:
IIRC, you can not have jobs from two partitions running concurrently on the same node, the requested resources are irrelevant. Seems a node can only be in a single partition at a time.
Diego
Il 13/04/26 13:02, Massimo Sgaravatto via slurm-users ha scritto:
Dear all
I (try to) manage a slurm cluster composed by some CPU-only nodes and some worker nodes which have also GPUs:
NodeName=cld-ter-[01-06] Sockets=2 CoresPerSocket=96 ThreadsPerCore=2 RealMemory=1536000 State=UNKNOWN NodeName=cld-ter-gpu-[01-05] Sockets=2 CoresPerSocket=96 ThreadsPerCore=2 Gres=gpu:nvidia-h100:4 RealMemory=1536000 State=UNKNOWN
The GPU nodes are exposed through multiple partitions:
PartitionName=gpus Nodes=cld-ter-gpu-[01-02] State=UP PriorityTier=20 PartitionName=sparch Nodes=cld-ter-gpu-03 AllowAccounts=sparch,operators QoS=sparch State=UP PriorityTier=20 PartitionName=geant4 Nodes=cld-ter-gpu-03 AllowAccounts=geant4,operators QoS=geant4 State=UP PriorityTier=20 PartitionName=enipred Nodes=cld-ter-gpu-04 AllowAccounts=enipred,operators QoS=enipred State=UP PriorityTier=20 PartitionName=enipiml Nodes=cld-ter-gpu-05 AllowAccounts=enipiml,operators QoS=enipiml State=UP PriorityTier=20
We also set a partition to allow cpu-only jobs on the GPU nodes, but these jobs should be preempted (killed and requeued) if jobs submitted to partitions with higher priorities require those resources:
PreemptType=preempt/partition_prio PreemptMode=REQUEUE PartitionName=onlycpus-opp Nodes=cld-ter-gpu-[01-05],cld-dfa-gpu-06,btc- dfa-gpu-02 State=UP PriorityTier=10
Now, I don't understand why this job [*] submitted on the onlycpus-opp partition can't start running e.g. on the cld-ter-gpu-01, since it has a lot of free resources:
[sgaravat@cld-ter-ui-01 ~]$ scontrol show node cld-ter-gpu-01 NodeName=cld-ter-gpu-01 Arch=x86_64 CoresPerSocket=96 CPUAlloc=8 CPUEfctv=384 CPUTot=384 CPULoad=5.93 AvailableFeatures=(null) ActiveFeatures=(null) Gres=gpu:nvidia-h100:4 NodeAddr=cld-ter-gpu-01 NodeHostName=cld-ter-gpu-01 Version=25.11.3 OS=Linux 5.14.0-611.45.1.el9_7.x86_64 #1 SMP PREEMPT_DYNAMIC Wed Apr 1 05:56:53 EDT 2026 RealMemory=1536000 AllocMem=560000 FreeMem=1192357 Sockets=2 Boards=1 State=MIXED+PLANNED ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A Partitions=gpus,onlycpus-opp BootTime=2026-04-09T10:39:35 SlurmdStartTime=2026-04-09T10:40:01 LastBusyTime=2026-04-09T11:54:46 ResumeAfterTime=None CfgTRES=cpu=384,mem=1500G,billing=839,gres/gpu=4,gres/gpu:nvidia-h100=4 AllocTRES=cpu=8,mem=560000M,gres/gpu=4,gres/gpu:nvidia-h100=4 CurrentWatts=0 AveWatts=0
I guess the "MIXED+PLANNED" is the answer, but as far as I can see only a job (283469) is planned for this worker node:
sgaravat@cld-ter-ui-01 ~]$ squeue --start | grep ter-gpu-01 JOBID PARTITION NAME USER ST START_TIME NODES SCHEDNODES NODELIST(REASON)
283469 gpus vllm-pod ciangott PD 2026-04-13T14:31:40 1 cld-ter-gpu-01 (Resources)
But job 283469 doesn't require too many resources [**], so the 2 jobs could run together. Why job 283534 can't start ? Any hints ?
Thanks, Massimo
[*]
[sgaravat@cld-ter-ui-01 ~]$ scontrol show job=283534 JobId=283534 JobName=myscript.sh UserId=sgaravat(5008) GroupId=tbadmin(5001) MCS_label=N/A Priority=542954 Nice=0 Account=operators QOS=normal JobState=RUNNING Reason=None Dependency=(null) Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0 RunTime=00:00:41 TimeLimit=1-00:00:00 TimeMin=N/A SubmitTime=2026-04-13T11:10:13 EligibleTime=2026-04-13T11:10:13 AccrueTime=2026-04-13T11:10:13 StartTime=2026-04-13T11:58:39 EndTime=2026-04-14T11:58:39 Deadline=N/A PreemptEligibleTime=2026-04-13T11:58:39 PreemptTime=None SuspendTime=None SecsPreSuspend=0 LastSchedEval=2026-04-13T11:58:39 Scheduler=Backfill Partition=onlycpus-opp AllocNode:Sid=cld-ter-ui-01:3035857 ReqNodeList=(null) ExcNodeList=(null) NodeList=btc-dfa-gpu-02 BatchHost=btc-dfa-gpu-02 NumNodes=1 NumCPUs=1 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:* ReqTRES=cpu=1,mem=100G,node=1,billing=26 AllocTRES=cpu=1,mem=100G,node=1,billing=26 Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=* MinCPUsNode=1 MinMemoryNode=100G MinTmpDiskNode=0 Features=(null) DelayBoot=00:00:00 OverSubscribe=OK Contiguous=0 Licenses=(null) LicensesAlloc=(null) Network=(null) Command=/shared/home/sgaravat/myscript.sh SubmitLine=sbatch myscript.sh WorkDir=/shared/home/sgaravat StdErr=/shared/home/sgaravat/JOB-myscript.sh.283534.4294967294.err StdIn=/dev/null StdOut=/shared/home/sgaravat/JOB-myscript.sh.283534.4294967294.out MailUser=massimo.sgaravatto@pd.infn.it <mailto:massimo.sgaravatto@pd.infn.it> MailType=INVALID_DEPEND,BEGIN,END,FAIL,REQUEUE,STAGE_OUT
[**] sgaravat@cld-ter-ui-01 ~]$ scontrol show job=283469 JobId=283469 JobName=vllm-pod UserId=ciangott(6054) GroupId=tbuser(6000) MCS_label=N/A Priority=499703 Nice=0 Account=cms QOS=normal JobState=PENDING Reason=Resources Dependency=(null) Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0 RunTime=00:00:00 TimeLimit=1-00:00:00 TimeMin=N/A SubmitTime=2026-04-13T06:48:37 EligibleTime=2026-04-13T06:48:37 AccrueTime=2026-04-13T06:48:37 StartTime=2026-04-13T14:31:40 EndTime=2026-04-14T14:31:40 Deadline=N/A SuspendTime=None SecsPreSuspend=0 LastSchedEval=2026-04-13T11:59:48 Scheduler=Main Partition=gpus AllocNode:Sid=cld-ter-ui-01:3015801 ReqNodeList=(null) ExcNodeList=(null) NodeList= SchedNodeList=cld-ter-gpu-01 NumNodes=1-1 NumCPUs=32 NumTasks=1 CPUs/Task=32 ReqB:S:C:T=0:0:*:* ReqTRES=cpu=32,mem=190734M,node=1,billing=118,gres/gpu=2,gres/ gpu:nvidia-h100=2 AllocTRES=(null) Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=* MinCPUsNode=32 MinMemoryNode=190734M MinTmpDiskNode=0 Features=(null) DelayBoot=00:00:00 OverSubscribe=OK Contiguous=0 Licenses=(null) LicensesAlloc=(null) Network=(null) Command=.interlink/jobs/default-0c0257f8-d1ea-4135- a602-96c229ce8516/job.slurm SubmitLine=sbatch .interlink/jobs/default-0c0257f8-d1ea-4135- a602-96c229ce8516/job.slurm WorkDir=/shared/home/ciangott StdErr= StdIn=/dev/null StdOut=/shared/home/ciangott/.interlink/jobs/default-0c0257f8- d1ea-4135-a602-96c229ce8516/job.out TresPerNode=gres/gpu:nvidia-h100:2 TresPerTask=cpu=32
Hi What do you mean that you can not have jobs from two partitions running concurrently on the same node ? E.g. right now the node btc-dfa-gpu-02 is running jobs from the qst and the onlycpus-opp partitions: sgaravat@cld-ter-ui-01 ~]$ squeue | grep btc-dfa 283558 onlycpus- myscript sgaravat R 0:10 1 btc-dfa-gpu-02 283559 onlycpus- myscript sgaravat R 0:10 1 btc-dfa-gpu-02 283560 onlycpus- myscript sgaravat R 0:10 1 btc-dfa-gpu-02 283561 onlycpus- myscript sgaravat R 0:10 1 btc-dfa-gpu-02 283562 onlycpus- myscript sgaravat R 0:10 1 btc-dfa-gpu-02 283563 onlycpus- myscript sgaravat R 0:10 1 btc-dfa-gpu-02 283382 qst morun_ci barone R 1-23:37:36 1 btc-dfa-gpu-02 283383 qst morun_ci barone R 1-23:37:36 1 btc-dfa-gpu-02 283388 qst morun_mv barone R 1-23:37:36 1 btc-dfa-gpu-02 283381 qst morun_ci barone R 1-23:37:37 1 btc-dfa-gpu-02 Cheers, Massimo On Mon, Apr 13, 2026 at 2:18 PM Diego Zuccato via slurm-users < slurm-users@lists.schedmd.com> wrote:
IIRC, you can not have jobs from two partitions running concurrently on the same node, the requested resources are irrelevant. Seems a node can only be in a single partition at a time.
Diego
Il 13/04/26 13:02, Massimo Sgaravatto via slurm-users ha scritto:
Dear all
I (try to) manage a slurm cluster composed by some CPU-only nodes and some worker nodes which have also GPUs:
NodeName=cld-ter-[01-06] Sockets=2 CoresPerSocket=96 ThreadsPerCore=2 RealMemory=1536000 State=UNKNOWN NodeName=cld-ter-gpu-[01-05] Sockets=2 CoresPerSocket=96 ThreadsPerCore=2 Gres=gpu:nvidia-h100:4 RealMemory=1536000 State=UNKNOWN
The GPU nodes are exposed through multiple partitions:
PartitionName=gpus Nodes=cld-ter-gpu-[01-02] State=UP PriorityTier=20 PartitionName=sparch Nodes=cld-ter-gpu-03 AllowAccounts=sparch,operators QoS=sparch State=UP PriorityTier=20 PartitionName=geant4 Nodes=cld-ter-gpu-03 AllowAccounts=geant4,operators QoS=geant4 State=UP PriorityTier=20 PartitionName=enipred Nodes=cld-ter-gpu-04 AllowAccounts=enipred,operators QoS=enipred State=UP PriorityTier=20 PartitionName=enipiml Nodes=cld-ter-gpu-05 AllowAccounts=enipiml,operators QoS=enipiml State=UP PriorityTier=20
We also set a partition to allow cpu-only jobs on the GPU nodes, but these jobs should be preempted (killed and requeued) if jobs submitted to partitions with higher priorities require those resources:
PreemptType=preempt/partition_prio PreemptMode=REQUEUE PartitionName=onlycpus-opp Nodes=cld-ter-gpu-[01-05],cld-dfa-gpu-06,btc- dfa-gpu-02 State=UP PriorityTier=10
Now, I don't understand why this job [*] submitted on the onlycpus-opp partition can't start running e.g. on the cld-ter-gpu-01, since it has a lot of free resources:
[sgaravat@cld-ter-ui-01 ~]$ scontrol show node cld-ter-gpu-01 NodeName=cld-ter-gpu-01 Arch=x86_64 CoresPerSocket=96 CPUAlloc=8 CPUEfctv=384 CPUTot=384 CPULoad=5.93 AvailableFeatures=(null) ActiveFeatures=(null) Gres=gpu:nvidia-h100:4 NodeAddr=cld-ter-gpu-01 NodeHostName=cld-ter-gpu-01 Version=25.11.3 OS=Linux 5.14.0-611.45.1.el9_7.x86_64 #1 SMP PREEMPT_DYNAMIC Wed Apr 1 05:56:53 EDT 2026 RealMemory=1536000 AllocMem=560000 FreeMem=1192357 Sockets=2 Boards=1 State=MIXED+PLANNED ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A Partitions=gpus,onlycpus-opp BootTime=2026-04-09T10:39:35 SlurmdStartTime=2026-04-09T10:40:01 LastBusyTime=2026-04-09T11:54:46 ResumeAfterTime=None
CfgTRES=cpu=384,mem=1500G,billing=839,gres/gpu=4,gres/gpu:nvidia-h100=4
AllocTRES=cpu=8,mem=560000M,gres/gpu=4,gres/gpu:nvidia-h100=4 CurrentWatts=0 AveWatts=0
I guess the "MIXED+PLANNED" is the answer, but as far as I can see only a job (283469) is planned for this worker node:
sgaravat@cld-ter-ui-01 ~]$ squeue --start | grep ter-gpu-01 JOBID PARTITION NAME USER ST START_TIME NODES SCHEDNODES NODELIST(REASON)
283469 gpus vllm-pod ciangott PD 2026-04-13T14:31:40 1 cld-ter-gpu-01 (Resources)
But job 283469 doesn't require too many resources [**], so the 2 jobs could run together. Why job 283534 can't start ? Any hints ?
Thanks, Massimo
[*]
[sgaravat@cld-ter-ui-01 ~]$ scontrol show job=283534 JobId=283534 JobName=myscript.sh UserId=sgaravat(5008) GroupId=tbadmin(5001) MCS_label=N/A Priority=542954 Nice=0 Account=operators QOS=normal JobState=RUNNING Reason=None Dependency=(null) Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0 RunTime=00:00:41 TimeLimit=1-00:00:00 TimeMin=N/A SubmitTime=2026-04-13T11:10:13 EligibleTime=2026-04-13T11:10:13 AccrueTime=2026-04-13T11:10:13 StartTime=2026-04-13T11:58:39 EndTime=2026-04-14T11:58:39
Deadline=N/A
PreemptEligibleTime=2026-04-13T11:58:39 PreemptTime=None SuspendTime=None SecsPreSuspend=0 LastSchedEval=2026-04-13T11:58:39 Scheduler=Backfill Partition=onlycpus-opp AllocNode:Sid=cld-ter-ui-01:3035857 ReqNodeList=(null) ExcNodeList=(null) NodeList=btc-dfa-gpu-02 BatchHost=btc-dfa-gpu-02 NumNodes=1 NumCPUs=1 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:* ReqTRES=cpu=1,mem=100G,node=1,billing=26 AllocTRES=cpu=1,mem=100G,node=1,billing=26 Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=* MinCPUsNode=1 MinMemoryNode=100G MinTmpDiskNode=0 Features=(null) DelayBoot=00:00:00 OverSubscribe=OK Contiguous=0 Licenses=(null) LicensesAlloc=(null) Network=(null) Command=/shared/home/sgaravat/myscript.sh SubmitLine=sbatch myscript.sh WorkDir=/shared/home/sgaravat StdErr=/shared/home/sgaravat/JOB-myscript.sh.283534.4294967294.err StdIn=/dev/null StdOut=/shared/home/sgaravat/JOB-myscript.sh.283534.4294967294.out MailUser=massimo.sgaravatto@pd.infn.it <mailto:massimo.sgaravatto@pd.infn.it> MailType=INVALID_DEPEND,BEGIN,END,FAIL,REQUEUE,STAGE_OUT
[**] sgaravat@cld-ter-ui-01 ~]$ scontrol show job=283469 JobId=283469 JobName=vllm-pod UserId=ciangott(6054) GroupId=tbuser(6000) MCS_label=N/A Priority=499703 Nice=0 Account=cms QOS=normal JobState=PENDING Reason=Resources Dependency=(null) Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0 RunTime=00:00:00 TimeLimit=1-00:00:00 TimeMin=N/A SubmitTime=2026-04-13T06:48:37 EligibleTime=2026-04-13T06:48:37 AccrueTime=2026-04-13T06:48:37 StartTime=2026-04-13T14:31:40 EndTime=2026-04-14T14:31:40
Deadline=N/A
SuspendTime=None SecsPreSuspend=0 LastSchedEval=2026-04-13T11:59:48 Scheduler=Main Partition=gpus AllocNode:Sid=cld-ter-ui-01:3015801 ReqNodeList=(null) ExcNodeList=(null) NodeList= SchedNodeList=cld-ter-gpu-01 NumNodes=1-1 NumCPUs=32 NumTasks=1 CPUs/Task=32 ReqB:S:C:T=0:0:*:* ReqTRES=cpu=32,mem=190734M,node=1,billing=118,gres/gpu=2,gres/ gpu:nvidia-h100=2 AllocTRES=(null) Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=* MinCPUsNode=32 MinMemoryNode=190734M MinTmpDiskNode=0 Features=(null) DelayBoot=00:00:00 OverSubscribe=OK Contiguous=0 Licenses=(null) LicensesAlloc=(null) Network=(null) Command=.interlink/jobs/default-0c0257f8-d1ea-4135- a602-96c229ce8516/job.slurm SubmitLine=sbatch .interlink/jobs/default-0c0257f8-d1ea-4135- a602-96c229ce8516/job.slurm WorkDir=/shared/home/ciangott StdErr= StdIn=/dev/null StdOut=/shared/home/ciangott/.interlink/jobs/default-0c0257f8- d1ea-4135-a602-96c229ce8516/job.out TresPerNode=gres/gpu:nvidia-h100:2 TresPerTask=cpu=32
-- Diego Zuccato DIFA - Dip. di Fisica e Astronomia Servizi Informatici Alma Mater Studiorum - Università di Bologna V.le Berti-Pichat 6/2 - 40127 Bologna - Italy tel.: +39 051 20 95786
-- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-leave@lists.schedmd.com
Good to know. When I tested it (more than 10 years ago...) I couldn't make it work and the users got quite upset. So we changed to using partitions just to group omogeneous nodes, while QoSs give limits and priorities. If that's not the issue, I have no idea what else could be, sorry. Diego Il 13/04/26 14:33, Massimo Sgaravatto ha scritto:
Hi
What do you mean that you can not have jobs from two partitions running concurrently on the same node ? E.g. right now the node btc-dfa-gpu-02 is running jobs from the qst and the onlycpus-opp partitions:
sgaravat@cld-ter-ui-01 ~]$ squeue | grep btc-dfa 283558 onlycpus- myscript sgaravat R 0:10 1 btc-dfa-gpu-02 283559 onlycpus- myscript sgaravat R 0:10 1 btc-dfa-gpu-02 283560 onlycpus- myscript sgaravat R 0:10 1 btc-dfa-gpu-02 283561 onlycpus- myscript sgaravat R 0:10 1 btc-dfa-gpu-02 283562 onlycpus- myscript sgaravat R 0:10 1 btc-dfa-gpu-02 283563 onlycpus- myscript sgaravat R 0:10 1 btc-dfa-gpu-02 283382 qst morun_ci barone R 1-23:37:36 1 btc-dfa-gpu-02 283383 qst morun_ci barone R 1-23:37:36 1 btc-dfa-gpu-02 283388 qst morun_mv barone R 1-23:37:36 1 btc-dfa-gpu-02 283381 qst morun_ci barone R 1-23:37:37 1 btc-dfa-gpu-02
Cheers, Massimo
On Mon, Apr 13, 2026 at 2:18 PM Diego Zuccato via slurm-users <slurm- users@lists.schedmd.com <mailto:slurm-users@lists.schedmd.com>> wrote:
IIRC, you can not have jobs from two partitions running concurrently on the same node, the requested resources are irrelevant. Seems a node can only be in a single partition at a time.
Diego
Il 13/04/26 13:02, Massimo Sgaravatto via slurm-users ha scritto: > Dear all > > I (try to) manage a slurm cluster composed by some CPU-only nodes and > some worker nodes which have also GPUs: > > NodeName=cld-ter-[01-06] Sockets=2 CoresPerSocket=96 ThreadsPerCore=2 > RealMemory=1536000 State=UNKNOWN > NodeName=cld-ter-gpu-[01-05] Sockets=2 CoresPerSocket=96 > ThreadsPerCore=2 Gres=gpu:nvidia-h100:4 RealMemory=1536000 State=UNKNOWN > > The GPU nodes are exposed through multiple partitions: > > > PartitionName=gpus Nodes=cld-ter-gpu-[01-02] State=UP PriorityTier=20 > PartitionName=sparch Nodes=cld-ter-gpu-03 AllowAccounts=sparch,operators > QoS=sparch State=UP PriorityTier=20 > PartitionName=geant4 Nodes=cld-ter-gpu-03 AllowAccounts=geant4,operators > QoS=geant4 State=UP PriorityTier=20 > PartitionName=enipred Nodes=cld-ter-gpu-04 > AllowAccounts=enipred,operators QoS=enipred State=UP PriorityTier=20 > PartitionName=enipiml Nodes=cld-ter-gpu-05 > AllowAccounts=enipiml,operators QoS=enipiml State=UP PriorityTier=20 > > > > We also set a partition to allow cpu-only jobs on the GPU nodes, but > these jobs should be preempted (killed and requeued) if jobs submitted > to partitions with higher priorities require those resources: > > > > PreemptType=preempt/partition_prio > PreemptMode=REQUEUE > PartitionName=onlycpus-opp Nodes=cld-ter-gpu-[01-05],cld-dfa- gpu-06,btc- > dfa-gpu-02 State=UP PriorityTier=10 > > Now, I don't understand why this job [*] submitted on the onlycpus-opp > partition can't start running e.g. on the cld-ter-gpu-01, since it has a > lot of free resources: > > [sgaravat@cld-ter-ui-01 ~]$ scontrol show node cld-ter-gpu-01 > NodeName=cld-ter-gpu-01 Arch=x86_64 CoresPerSocket=96 > CPUAlloc=8 CPUEfctv=384 CPUTot=384 CPULoad=5.93 > AvailableFeatures=(null) > ActiveFeatures=(null) > Gres=gpu:nvidia-h100:4 > NodeAddr=cld-ter-gpu-01 NodeHostName=cld-ter-gpu-01 Version=25.11.3 > OS=Linux 5.14.0-611.45.1.el9_7.x86_64 #1 SMP PREEMPT_DYNAMIC Wed Apr > 1 05:56:53 EDT 2026 > RealMemory=1536000 AllocMem=560000 FreeMem=1192357 Sockets=2 Boards=1 > State=MIXED+PLANNED ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A > MCS_label=N/A > Partitions=gpus,onlycpus-opp > BootTime=2026-04-09T10:39:35 SlurmdStartTime=2026-04-09T10:40:01 > LastBusyTime=2026-04-09T11:54:46 ResumeAfterTime=None > CfgTRES=cpu=384,mem=1500G,billing=839,gres/gpu=4,gres/ gpu:nvidia-h100=4 > AllocTRES=cpu=8,mem=560000M,gres/gpu=4,gres/gpu:nvidia-h100=4 > CurrentWatts=0 AveWatts=0 > > > I guess the "MIXED+PLANNED" is the answer, but as far as I can see only > a job (283469) is planned for this worker node: > > sgaravat@cld-ter-ui-01 ~]$ squeue --start | grep ter-gpu-01 > JOBID PARTITION NAME USER ST START_TIME > NODES SCHEDNODES NODELIST(REASON) > > 283469 gpus vllm-pod ciangott PD 2026-04-13T14:31:40 > 1 cld-ter-gpu-01 (Resources) > > But job 283469 doesn't require too many resources [**], so the 2 jobs > could run together. Why job 283534 can't start ? > Any hints ? > > Thanks, Massimo > > > > [*] > > [sgaravat@cld-ter-ui-01 ~]$ scontrol show job=283534 > JobId=283534 JobName=myscript.sh > UserId=sgaravat(5008) GroupId=tbadmin(5001) MCS_label=N/A > Priority=542954 Nice=0 Account=operators QOS=normal > JobState=RUNNING Reason=None Dependency=(null) > Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0 > RunTime=00:00:41 TimeLimit=1-00:00:00 TimeMin=N/A > SubmitTime=2026-04-13T11:10:13 EligibleTime=2026-04-13T11:10:13 > AccrueTime=2026-04-13T11:10:13 > StartTime=2026-04-13T11:58:39 EndTime=2026-04-14T11:58:39 Deadline=N/A > PreemptEligibleTime=2026-04-13T11:58:39 PreemptTime=None > SuspendTime=None SecsPreSuspend=0 LastSchedEval=2026-04-13T11:58:39 > Scheduler=Backfill > Partition=onlycpus-opp AllocNode:Sid=cld-ter-ui-01:3035857 > ReqNodeList=(null) ExcNodeList=(null) > NodeList=btc-dfa-gpu-02 > BatchHost=btc-dfa-gpu-02 > NumNodes=1 NumCPUs=1 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:* > ReqTRES=cpu=1,mem=100G,node=1,billing=26 > AllocTRES=cpu=1,mem=100G,node=1,billing=26 > Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=* > MinCPUsNode=1 MinMemoryNode=100G MinTmpDiskNode=0 > Features=(null) DelayBoot=00:00:00 > OverSubscribe=OK Contiguous=0 Licenses=(null) LicensesAlloc=(null) > Network=(null) > Command=/shared/home/sgaravat/myscript.sh > SubmitLine=sbatch myscript.sh > WorkDir=/shared/home/sgaravat > StdErr=/shared/home/sgaravat/JOB- myscript.sh.283534.4294967294.err > StdIn=/dev/null > StdOut=/shared/home/sgaravat/JOB- myscript.sh.283534.4294967294.out > MailUser=massimo.sgaravatto@pd.infn.it <mailto:massimo.sgaravatto@pd.infn.it> > <mailto:massimo.sgaravatto@pd.infn.it <mailto:massimo.sgaravatto@pd.infn.it>> > MailType=INVALID_DEPEND,BEGIN,END,FAIL,REQUEUE,STAGE_OUT > > [**] > sgaravat@cld-ter-ui-01 ~]$ scontrol show job=283469 > JobId=283469 JobName=vllm-pod > UserId=ciangott(6054) GroupId=tbuser(6000) MCS_label=N/A > Priority=499703 Nice=0 Account=cms QOS=normal > JobState=PENDING Reason=Resources Dependency=(null) > Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0 > RunTime=00:00:00 TimeLimit=1-00:00:00 TimeMin=N/A > SubmitTime=2026-04-13T06:48:37 EligibleTime=2026-04-13T06:48:37 > AccrueTime=2026-04-13T06:48:37 > StartTime=2026-04-13T14:31:40 EndTime=2026-04-14T14:31:40 Deadline=N/A > SuspendTime=None SecsPreSuspend=0 LastSchedEval=2026-04-13T11:59:48 > Scheduler=Main > Partition=gpus AllocNode:Sid=cld-ter-ui-01:3015801 > ReqNodeList=(null) ExcNodeList=(null) > NodeList= SchedNodeList=cld-ter-gpu-01 > NumNodes=1-1 NumCPUs=32 NumTasks=1 CPUs/Task=32 ReqB:S:C:T=0:0:*:* > ReqTRES=cpu=32,mem=190734M,node=1,billing=118,gres/gpu=2,gres/ > gpu:nvidia-h100=2 > AllocTRES=(null) > Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=* > MinCPUsNode=32 MinMemoryNode=190734M MinTmpDiskNode=0 > Features=(null) DelayBoot=00:00:00 > OverSubscribe=OK Contiguous=0 Licenses=(null) LicensesAlloc=(null) > Network=(null) > Command=.interlink/jobs/default-0c0257f8-d1ea-4135- > a602-96c229ce8516/job.slurm > SubmitLine=sbatch .interlink/jobs/default-0c0257f8-d1ea-4135- > a602-96c229ce8516/job.slurm > WorkDir=/shared/home/ciangott > StdErr= > StdIn=/dev/null > StdOut=/shared/home/ciangott/.interlink/jobs/default-0c0257f8- > d1ea-4135-a602-96c229ce8516/job.out > TresPerNode=gres/gpu:nvidia-h100:2 > TresPerTask=cpu=32 > > >
-- Diego Zuccato DIFA - Dip. di Fisica e Astronomia Servizi Informatici Alma Mater Studiorum - Università di Bologna V.le Berti-Pichat 6/2 - 40127 Bologna - Italy tel.: +39 051 20 95786
-- slurm-users mailing list -- slurm-users@lists.schedmd.com <mailto:slurm-users@lists.schedmd.com> To unsubscribe send an email to slurm-users-leave@lists.schedmd.com <mailto:slurm-users-leave@lists.schedmd.com>
-- Diego Zuccato DIFA - Dip. di Fisica e Astronomia Servizi Informatici Alma Mater Studiorum - Università di Bologna V.le Berti-Pichat 6/2 - 40127 Bologna - Italy tel.: +39 051 20 95786
On 4/13/26 4:54 am, Diego Zuccato via slurm-users wrote:
Seems a node can only be in a single partition at a time.
That's not true in my experience, we run our systems in that way with many overlapping partitions (every node is in at least 3) and that has not caused problems for us. All the best, Chris -- Chris Samuel : http://www.csamuel.org/ : Philadelphia, PA, USA
Let me add that, if I modify the timelimit of the job so that it can finish before 2026-04-13T14:31:40 (i.e. the time when job 283469 is supposed to start): scontrol update JobId=283534 TimeLimit=10:00:00 the job starts running on the worker node cld-ter-gpu-01 Any hints to understand the issue is really appreciated :-) Thanks, Massimo On Mon, Apr 13, 2026 at 1:02 PM Massimo Sgaravatto < massimo.sgaravatto@gmail.com> wrote:
Dear all
I (try to) manage a slurm cluster composed by some CPU-only nodes and some worker nodes which have also GPUs:
NodeName=cld-ter-[01-06] Sockets=2 CoresPerSocket=96 ThreadsPerCore=2 RealMemory=1536000 State=UNKNOWN NodeName=cld-ter-gpu-[01-05] Sockets=2 CoresPerSocket=96 ThreadsPerCore=2 Gres=gpu:nvidia-h100:4 RealMemory=1536000 State=UNKNOWN
The GPU nodes are exposed through multiple partitions:
PartitionName=gpus Nodes=cld-ter-gpu-[01-02] State=UP PriorityTier=20 PartitionName=sparch Nodes=cld-ter-gpu-03 AllowAccounts=sparch,operators QoS=sparch State=UP PriorityTier=20 PartitionName=geant4 Nodes=cld-ter-gpu-03 AllowAccounts=geant4,operators QoS=geant4 State=UP PriorityTier=20 PartitionName=enipred Nodes=cld-ter-gpu-04 AllowAccounts=enipred,operators QoS=enipred State=UP PriorityTier=20 PartitionName=enipiml Nodes=cld-ter-gpu-05 AllowAccounts=enipiml,operators QoS=enipiml State=UP PriorityTier=20
We also set a partition to allow cpu-only jobs on the GPU nodes, but these jobs should be preempted (killed and requeued) if jobs submitted to partitions with higher priorities require those resources:
PreemptType=preempt/partition_prio PreemptMode=REQUEUE PartitionName=onlycpus-opp Nodes=cld-ter-gpu-[01-05],cld-dfa-gpu-06,btc-dfa-gpu-02 State=UP PriorityTier=10
Now, I don't understand why this job [*] submitted on the onlycpus-opp partition can't start running e.g. on the cld-ter-gpu-01, since it has a lot of free resources:
[sgaravat@cld-ter-ui-01 ~]$ scontrol show node cld-ter-gpu-01 NodeName=cld-ter-gpu-01 Arch=x86_64 CoresPerSocket=96 CPUAlloc=8 CPUEfctv=384 CPUTot=384 CPULoad=5.93 AvailableFeatures=(null) ActiveFeatures=(null) Gres=gpu:nvidia-h100:4 NodeAddr=cld-ter-gpu-01 NodeHostName=cld-ter-gpu-01 Version=25.11.3 OS=Linux 5.14.0-611.45.1.el9_7.x86_64 #1 SMP PREEMPT_DYNAMIC Wed Apr 1 05:56:53 EDT 2026 RealMemory=1536000 AllocMem=560000 FreeMem=1192357 Sockets=2 Boards=1 State=MIXED+PLANNED ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A Partitions=gpus,onlycpus-opp BootTime=2026-04-09T10:39:35 SlurmdStartTime=2026-04-09T10:40:01 LastBusyTime=2026-04-09T11:54:46 ResumeAfterTime=None CfgTRES=cpu=384,mem=1500G,billing=839,gres/gpu=4,gres/gpu:nvidia-h100=4 AllocTRES=cpu=8,mem=560000M,gres/gpu=4,gres/gpu:nvidia-h100=4 CurrentWatts=0 AveWatts=0
I guess the "MIXED+PLANNED" is the answer, but as far as I can see only a job (283469) is planned for this worker node:
sgaravat@cld-ter-ui-01 ~]$ squeue --start | grep ter-gpu-01 JOBID PARTITION NAME USER ST START_TIME NODES SCHEDNODES NODELIST(REASON)
283469 gpus vllm-pod ciangott PD 2026-04-13T14:31:40 1 cld-ter-gpu-01 (Resources)
But job 283469 doesn't require too many resources [**], so the 2 jobs could run together. Why job 283534 can't start ? Any hints ?
Thanks, Massimo
[*]
[sgaravat@cld-ter-ui-01 ~]$ scontrol show job=283534 JobId=283534 JobName=myscript.sh UserId=sgaravat(5008) GroupId=tbadmin(5001) MCS_label=N/A Priority=542954 Nice=0 Account=operators QOS=normal JobState=RUNNING Reason=None Dependency=(null) Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0 RunTime=00:00:41 TimeLimit=1-00:00:00 TimeMin=N/A SubmitTime=2026-04-13T11:10:13 EligibleTime=2026-04-13T11:10:13 AccrueTime=2026-04-13T11:10:13 StartTime=2026-04-13T11:58:39 EndTime=2026-04-14T11:58:39 Deadline=N/A PreemptEligibleTime=2026-04-13T11:58:39 PreemptTime=None SuspendTime=None SecsPreSuspend=0 LastSchedEval=2026-04-13T11:58:39 Scheduler=Backfill Partition=onlycpus-opp AllocNode:Sid=cld-ter-ui-01:3035857 ReqNodeList=(null) ExcNodeList=(null) NodeList=btc-dfa-gpu-02 BatchHost=btc-dfa-gpu-02 NumNodes=1 NumCPUs=1 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:* ReqTRES=cpu=1,mem=100G,node=1,billing=26 AllocTRES=cpu=1,mem=100G,node=1,billing=26 Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=* MinCPUsNode=1 MinMemoryNode=100G MinTmpDiskNode=0 Features=(null) DelayBoot=00:00:00 OverSubscribe=OK Contiguous=0 Licenses=(null) LicensesAlloc=(null) Network=(null) Command=/shared/home/sgaravat/myscript.sh SubmitLine=sbatch myscript.sh WorkDir=/shared/home/sgaravat StdErr=/shared/home/sgaravat/JOB-myscript.sh.283534.4294967294.err StdIn=/dev/null StdOut=/shared/home/sgaravat/JOB-myscript.sh.283534.4294967294.out MailUser=massimo.sgaravatto@pd.infn.it MailType=INVALID_DEPEND,BEGIN,END,FAIL,REQUEUE,STAGE_OUT
[**] sgaravat@cld-ter-ui-01 ~]$ scontrol show job=283469 JobId=283469 JobName=vllm-pod UserId=ciangott(6054) GroupId=tbuser(6000) MCS_label=N/A Priority=499703 Nice=0 Account=cms QOS=normal JobState=PENDING Reason=Resources Dependency=(null) Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0 RunTime=00:00:00 TimeLimit=1-00:00:00 TimeMin=N/A SubmitTime=2026-04-13T06:48:37 EligibleTime=2026-04-13T06:48:37 AccrueTime=2026-04-13T06:48:37 StartTime=2026-04-13T14:31:40 EndTime=2026-04-14T14:31:40 Deadline=N/A SuspendTime=None SecsPreSuspend=0 LastSchedEval=2026-04-13T11:59:48 Scheduler=Main Partition=gpus AllocNode:Sid=cld-ter-ui-01:3015801 ReqNodeList=(null) ExcNodeList=(null) NodeList= SchedNodeList=cld-ter-gpu-01 NumNodes=1-1 NumCPUs=32 NumTasks=1 CPUs/Task=32 ReqB:S:C:T=0:0:*:*
ReqTRES=cpu=32,mem=190734M,node=1,billing=118,gres/gpu=2,gres/gpu:nvidia-h100=2 AllocTRES=(null) Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=* MinCPUsNode=32 MinMemoryNode=190734M MinTmpDiskNode=0 Features=(null) DelayBoot=00:00:00 OverSubscribe=OK Contiguous=0 Licenses=(null) LicensesAlloc=(null) Network=(null)
Command=.interlink/jobs/default-0c0257f8-d1ea-4135-a602-96c229ce8516/job.slurm SubmitLine=sbatch .interlink/jobs/default-0c0257f8-d1ea-4135-a602-96c229ce8516/job.slurm WorkDir=/shared/home/ciangott StdErr= StdIn=/dev/null
StdOut=/shared/home/ciangott/.interlink/jobs/default-0c0257f8-d1ea-4135-a602-96c229ce8516/job.out TresPerNode=gres/gpu:nvidia-h100:2 TresPerTask=cpu=32
If you set that timelimit as you said, you must be backfilling. As such, I speculate (without having read all the details you wrote, sorry I'm in a hurry) that the other job that is starting is "larger" than you think, not leaving enough resources for the job to start earlier. On Mon, Apr 13, 2026 at 10:22 AM Massimo Sgaravatto via slurm-users < slurm-users@lists.schedmd.com> wrote:
Let me add that, if I modify the timelimit of the job so that it can finish before 2026-04-13T14:31:40 (i.e. the time when job 283469 is supposed to start):
scontrol update JobId=283534 TimeLimit=10:00:00
the job starts running on the worker node cld-ter-gpu-01
Any hints to understand the issue is really appreciated :-)
Thanks, Massimo
On Mon, Apr 13, 2026 at 1:02 PM Massimo Sgaravatto < massimo.sgaravatto@gmail.com> wrote:
Dear all
I (try to) manage a slurm cluster composed by some CPU-only nodes and some worker nodes which have also GPUs:
NodeName=cld-ter-[01-06] Sockets=2 CoresPerSocket=96 ThreadsPerCore=2 RealMemory=1536000 State=UNKNOWN NodeName=cld-ter-gpu-[01-05] Sockets=2 CoresPerSocket=96 ThreadsPerCore=2 Gres=gpu:nvidia-h100:4 RealMemory=1536000 State=UNKNOWN
The GPU nodes are exposed through multiple partitions:
PartitionName=gpus Nodes=cld-ter-gpu-[01-02] State=UP PriorityTier=20 PartitionName=sparch Nodes=cld-ter-gpu-03 AllowAccounts=sparch,operators QoS=sparch State=UP PriorityTier=20 PartitionName=geant4 Nodes=cld-ter-gpu-03 AllowAccounts=geant4,operators QoS=geant4 State=UP PriorityTier=20 PartitionName=enipred Nodes=cld-ter-gpu-04 AllowAccounts=enipred,operators QoS=enipred State=UP PriorityTier=20 PartitionName=enipiml Nodes=cld-ter-gpu-05 AllowAccounts=enipiml,operators QoS=enipiml State=UP PriorityTier=20
We also set a partition to allow cpu-only jobs on the GPU nodes, but these jobs should be preempted (killed and requeued) if jobs submitted to partitions with higher priorities require those resources:
PreemptType=preempt/partition_prio PreemptMode=REQUEUE PartitionName=onlycpus-opp Nodes=cld-ter-gpu-[01-05],cld-dfa-gpu-06,btc-dfa-gpu-02 State=UP PriorityTier=10
Now, I don't understand why this job [*] submitted on the onlycpus-opp partition can't start running e.g. on the cld-ter-gpu-01, since it has a lot of free resources:
[sgaravat@cld-ter-ui-01 ~]$ scontrol show node cld-ter-gpu-01 NodeName=cld-ter-gpu-01 Arch=x86_64 CoresPerSocket=96 CPUAlloc=8 CPUEfctv=384 CPUTot=384 CPULoad=5.93 AvailableFeatures=(null) ActiveFeatures=(null) Gres=gpu:nvidia-h100:4 NodeAddr=cld-ter-gpu-01 NodeHostName=cld-ter-gpu-01 Version=25.11.3 OS=Linux 5.14.0-611.45.1.el9_7.x86_64 #1 SMP PREEMPT_DYNAMIC Wed Apr 1 05:56:53 EDT 2026 RealMemory=1536000 AllocMem=560000 FreeMem=1192357 Sockets=2 Boards=1 State=MIXED+PLANNED ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A Partitions=gpus,onlycpus-opp BootTime=2026-04-09T10:39:35 SlurmdStartTime=2026-04-09T10:40:01 LastBusyTime=2026-04-09T11:54:46 ResumeAfterTime=None CfgTRES=cpu=384,mem=1500G,billing=839,gres/gpu=4,gres/gpu:nvidia-h100=4 AllocTRES=cpu=8,mem=560000M,gres/gpu=4,gres/gpu:nvidia-h100=4 CurrentWatts=0 AveWatts=0
I guess the "MIXED+PLANNED" is the answer, but as far as I can see only a job (283469) is planned for this worker node:
sgaravat@cld-ter-ui-01 ~]$ squeue --start | grep ter-gpu-01 JOBID PARTITION NAME USER ST START_TIME NODES SCHEDNODES NODELIST(REASON)
283469 gpus vllm-pod ciangott PD 2026-04-13T14:31:40 1 cld-ter-gpu-01 (Resources)
But job 283469 doesn't require too many resources [**], so the 2 jobs could run together. Why job 283534 can't start ? Any hints ?
Thanks, Massimo
[*]
[sgaravat@cld-ter-ui-01 ~]$ scontrol show job=283534 JobId=283534 JobName=myscript.sh UserId=sgaravat(5008) GroupId=tbadmin(5001) MCS_label=N/A Priority=542954 Nice=0 Account=operators QOS=normal JobState=RUNNING Reason=None Dependency=(null) Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0 RunTime=00:00:41 TimeLimit=1-00:00:00 TimeMin=N/A SubmitTime=2026-04-13T11:10:13 EligibleTime=2026-04-13T11:10:13 AccrueTime=2026-04-13T11:10:13 StartTime=2026-04-13T11:58:39 EndTime=2026-04-14T11:58:39 Deadline=N/A PreemptEligibleTime=2026-04-13T11:58:39 PreemptTime=None SuspendTime=None SecsPreSuspend=0 LastSchedEval=2026-04-13T11:58:39 Scheduler=Backfill Partition=onlycpus-opp AllocNode:Sid=cld-ter-ui-01:3035857 ReqNodeList=(null) ExcNodeList=(null) NodeList=btc-dfa-gpu-02 BatchHost=btc-dfa-gpu-02 NumNodes=1 NumCPUs=1 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:* ReqTRES=cpu=1,mem=100G,node=1,billing=26 AllocTRES=cpu=1,mem=100G,node=1,billing=26 Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=* MinCPUsNode=1 MinMemoryNode=100G MinTmpDiskNode=0 Features=(null) DelayBoot=00:00:00 OverSubscribe=OK Contiguous=0 Licenses=(null) LicensesAlloc=(null) Network=(null) Command=/shared/home/sgaravat/myscript.sh SubmitLine=sbatch myscript.sh WorkDir=/shared/home/sgaravat StdErr=/shared/home/sgaravat/JOB-myscript.sh.283534.4294967294.err StdIn=/dev/null StdOut=/shared/home/sgaravat/JOB-myscript.sh.283534.4294967294.out MailUser=massimo.sgaravatto@pd.infn.it MailType=INVALID_DEPEND,BEGIN,END,FAIL,REQUEUE,STAGE_OUT
[**] sgaravat@cld-ter-ui-01 ~]$ scontrol show job=283469 JobId=283469 JobName=vllm-pod UserId=ciangott(6054) GroupId=tbuser(6000) MCS_label=N/A Priority=499703 Nice=0 Account=cms QOS=normal JobState=PENDING Reason=Resources Dependency=(null) Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0 RunTime=00:00:00 TimeLimit=1-00:00:00 TimeMin=N/A SubmitTime=2026-04-13T06:48:37 EligibleTime=2026-04-13T06:48:37 AccrueTime=2026-04-13T06:48:37 StartTime=2026-04-13T14:31:40 EndTime=2026-04-14T14:31:40 Deadline=N/A SuspendTime=None SecsPreSuspend=0 LastSchedEval=2026-04-13T11:59:48 Scheduler=Main Partition=gpus AllocNode:Sid=cld-ter-ui-01:3015801 ReqNodeList=(null) ExcNodeList=(null) NodeList= SchedNodeList=cld-ter-gpu-01 NumNodes=1-1 NumCPUs=32 NumTasks=1 CPUs/Task=32 ReqB:S:C:T=0:0:*:*
ReqTRES=cpu=32,mem=190734M,node=1,billing=118,gres/gpu=2,gres/gpu:nvidia-h100=2 AllocTRES=(null) Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=* MinCPUsNode=32 MinMemoryNode=190734M MinTmpDiskNode=0 Features=(null) DelayBoot=00:00:00 OverSubscribe=OK Contiguous=0 Licenses=(null) LicensesAlloc=(null) Network=(null)
Command=.interlink/jobs/default-0c0257f8-d1ea-4135-a602-96c229ce8516/job.slurm SubmitLine=sbatch .interlink/jobs/default-0c0257f8-d1ea-4135-a602-96c229ce8516/job.slurm WorkDir=/shared/home/ciangott StdErr= StdIn=/dev/null
StdOut=/shared/home/ciangott/.interlink/jobs/default-0c0257f8-d1ea-4135-a602-96c229ce8516/job.out TresPerNode=gres/gpu:nvidia-h100:2 TresPerTask=cpu=32
-- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-leave@lists.schedmd.com
I could be wrong again ( :) ), but I suspect Slurm won't start a job it already knows will be preempted: preemption is considered only at job submission time for the higher priority job (let's see if this new job can preempt some other to start sooner). Diego Il 13/04/26 17:57, Massimo Sgaravatto via slurm-users ha scritto:
Let me add that, if I modify the timelimit of the job so that it can finish before 2026-04-13T14:31:40 (i.e. the time when job 283469 is supposed to start):
scontrol update JobId=283534 TimeLimit=10:00:00
the job starts running on the worker node cld-ter-gpu-01
Any hints to understand the issue is really appreciated :-)
Thanks, Massimo
On Mon, Apr 13, 2026 at 1:02 PM Massimo Sgaravatto <massimo.sgaravatto@gmail.com <mailto:massimo.sgaravatto@gmail.com>> wrote:
Dear all
I (try to) manage a slurm cluster composed by some CPU-only nodes and some worker nodes which have also GPUs:
NodeName=cld-ter-[01-06] Sockets=2 CoresPerSocket=96 ThreadsPerCore=2 RealMemory=1536000 State=UNKNOWN NodeName=cld-ter-gpu-[01-05] Sockets=2 CoresPerSocket=96 ThreadsPerCore=2 Gres=gpu:nvidia-h100:4 RealMemory=1536000 State=UNKNOWN
The GPU nodes are exposed through multiple partitions:
PartitionName=gpus Nodes=cld-ter-gpu-[01-02] State=UP PriorityTier=20 PartitionName=sparch Nodes=cld-ter-gpu-03 AllowAccounts=sparch,operators QoS=sparch State=UP PriorityTier=20 PartitionName=geant4 Nodes=cld-ter-gpu-03 AllowAccounts=geant4,operators QoS=geant4 State=UP PriorityTier=20 PartitionName=enipred Nodes=cld-ter-gpu-04 AllowAccounts=enipred,operators QoS=enipred State=UP PriorityTier=20 PartitionName=enipiml Nodes=cld-ter-gpu-05 AllowAccounts=enipiml,operators QoS=enipiml State=UP PriorityTier=20
We also set a partition to allow cpu-only jobs on the GPU nodes, but these jobs should be preempted (killed and requeued) if jobs submitted to partitions with higher priorities require those resources:
PreemptType=preempt/partition_prio PreemptMode=REQUEUE PartitionName=onlycpus-opp Nodes=cld-ter-gpu-[01-05],cld-dfa- gpu-06,btc-dfa-gpu-02 State=UP PriorityTier=10
Now, I don't understand why this job [*] submitted on the onlycpus- opp partition can't start running e.g. on the cld-ter-gpu-01, since it has a lot of free resources:
[sgaravat@cld-ter-ui-01 ~]$ scontrol show node cld-ter-gpu-01 NodeName=cld-ter-gpu-01 Arch=x86_64 CoresPerSocket=96 CPUAlloc=8 CPUEfctv=384 CPUTot=384 CPULoad=5.93 AvailableFeatures=(null) ActiveFeatures=(null) Gres=gpu:nvidia-h100:4 NodeAddr=cld-ter-gpu-01 NodeHostName=cld-ter-gpu-01 Version=25.11.3 OS=Linux 5.14.0-611.45.1.el9_7.x86_64 #1 SMP PREEMPT_DYNAMIC Wed Apr 1 05:56:53 EDT 2026 RealMemory=1536000 AllocMem=560000 FreeMem=1192357 Sockets=2 Boards=1 State=MIXED+PLANNED ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/ A MCS_label=N/A Partitions=gpus,onlycpus-opp BootTime=2026-04-09T10:39:35 SlurmdStartTime=2026-04-09T10:40:01 LastBusyTime=2026-04-09T11:54:46 ResumeAfterTime=None CfgTRES=cpu=384,mem=1500G,billing=839,gres/gpu=4,gres/ gpu:nvidia-h100=4 AllocTRES=cpu=8,mem=560000M,gres/gpu=4,gres/gpu:nvidia-h100=4 CurrentWatts=0 AveWatts=0
I guess the "MIXED+PLANNED" is the answer, but as far as I can see only a job (283469) is planned for this worker node:
sgaravat@cld-ter-ui-01 ~]$ squeue --start | grep ter-gpu-01 JOBID PARTITION NAME USER ST START_TIME NODES SCHEDNODES NODELIST(REASON)
283469 gpus vllm-pod ciangott PD 2026-04-13T14:31:40 1 cld-ter-gpu-01 (Resources)
But job 283469 doesn't require too many resources [**], so the 2 jobs could run together. Why job 283534 can't start ? Any hints ?
Thanks, Massimo
[*]
[sgaravat@cld-ter-ui-01 ~]$ scontrol show job=283534 JobId=283534 JobName=myscript.sh UserId=sgaravat(5008) GroupId=tbadmin(5001) MCS_label=N/A Priority=542954 Nice=0 Account=operators QOS=normal JobState=RUNNING Reason=None Dependency=(null) Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0 RunTime=00:00:41 TimeLimit=1-00:00:00 TimeMin=N/A SubmitTime=2026-04-13T11:10:13 EligibleTime=2026-04-13T11:10:13 AccrueTime=2026-04-13T11:10:13 StartTime=2026-04-13T11:58:39 EndTime=2026-04-14T11:58:39 Deadline=N/A PreemptEligibleTime=2026-04-13T11:58:39 PreemptTime=None SuspendTime=None SecsPreSuspend=0 LastSchedEval=2026-04-13T11:58:39 Scheduler=Backfill Partition=onlycpus-opp AllocNode:Sid=cld-ter-ui-01:3035857 ReqNodeList=(null) ExcNodeList=(null) NodeList=btc-dfa-gpu-02 BatchHost=btc-dfa-gpu-02 NumNodes=1 NumCPUs=1 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:* ReqTRES=cpu=1,mem=100G,node=1,billing=26 AllocTRES=cpu=1,mem=100G,node=1,billing=26 Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=* MinCPUsNode=1 MinMemoryNode=100G MinTmpDiskNode=0 Features=(null) DelayBoot=00:00:00 OverSubscribe=OK Contiguous=0 Licenses=(null) LicensesAlloc=(null) Network=(null) Command=/shared/home/sgaravat/myscript.sh SubmitLine=sbatch myscript.sh WorkDir=/shared/home/sgaravat StdErr=/shared/home/sgaravat/JOB-myscript.sh.283534.4294967294.err StdIn=/dev/null StdOut=/shared/home/sgaravat/JOB-myscript.sh.283534.4294967294.out MailUser=massimo.sgaravatto@pd.infn.it <mailto:massimo.sgaravatto@pd.infn.it> MailType=INVALID_DEPEND,BEGIN,END,FAIL,REQUEUE,STAGE_OUT
[**] sgaravat@cld-ter-ui-01 ~]$ scontrol show job=283469 JobId=283469 JobName=vllm-pod UserId=ciangott(6054) GroupId=tbuser(6000) MCS_label=N/A Priority=499703 Nice=0 Account=cms QOS=normal JobState=PENDING Reason=Resources Dependency=(null) Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0 RunTime=00:00:00 TimeLimit=1-00:00:00 TimeMin=N/A SubmitTime=2026-04-13T06:48:37 EligibleTime=2026-04-13T06:48:37 AccrueTime=2026-04-13T06:48:37 StartTime=2026-04-13T14:31:40 EndTime=2026-04-14T14:31:40 Deadline=N/A SuspendTime=None SecsPreSuspend=0 LastSchedEval=2026-04-13T11:59:48 Scheduler=Main Partition=gpus AllocNode:Sid=cld-ter-ui-01:3015801 ReqNodeList=(null) ExcNodeList=(null) NodeList= SchedNodeList=cld-ter-gpu-01 NumNodes=1-1 NumCPUs=32 NumTasks=1 CPUs/Task=32 ReqB:S:C:T=0:0:*:* ReqTRES=cpu=32,mem=190734M,node=1,billing=118,gres/gpu=2,gres/ gpu:nvidia-h100=2 AllocTRES=(null) Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=* MinCPUsNode=32 MinMemoryNode=190734M MinTmpDiskNode=0 Features=(null) DelayBoot=00:00:00 OverSubscribe=OK Contiguous=0 Licenses=(null) LicensesAlloc=(null) Network=(null) Command=.interlink/jobs/default-0c0257f8-d1ea-4135- a602-96c229ce8516/job.slurm SubmitLine=sbatch .interlink/jobs/default-0c0257f8-d1ea-4135- a602-96c229ce8516/job.slurm WorkDir=/shared/home/ciangott StdErr= StdIn=/dev/null StdOut=/shared/home/ciangott/.interlink/jobs/default-0c0257f8- d1ea-4135-a602-96c229ce8516/job.out TresPerNode=gres/gpu:nvidia-h100:2 TresPerTask=cpu=32
-- Diego Zuccato DIFA - Dip. di Fisica e Astronomia Servizi Informatici Alma Mater Studiorum - Università di Bologna V.le Berti-Pichat 6/2 - 40127 Bologna - Italy tel.: +39 051 20 95786
On 4/13/26 4:02 am, Massimo Sgaravatto via slurm-users wrote:
CfgTRES=cpu=384,mem=1500G,billing=839,gres/gpu=4,gres/gpu:nvidia-h100=4 AllocTRES=cpu=8,mem=560000M,gres/gpu=4,gres/gpu:nvidia-h100=4
For some reason whatever jobs are running on that node are consuming all 4 GPUs - now the job you mention isn't asking for them:
ReqTRES=cpu=1,mem=100G,node=1,billing=26 AllocTRES=cpu=1,mem=100G,node=1,billing=26
So is it possible there's another job on there too? What does "squeue -w cld-ter-gpu-01" say? Also what does "scontrol show part onlycpus-opp" say? All the best, Chris -- Chris Samuel : http://www.csamuel.org/ : Philadelphia, PA, USA
I didn't mention that I have: SchedulerType=sched/backfill in slurm.conf I am reading https://slurm.schedmd.com/sched_config.html where it is written: If the job under consideration can start immediately without impacting the expected start time of any higher priority job, then it does so but is is also written something that I am not able to fully understand: For performance reasons, the backfill scheduler reserves whole nodes for jobs, even if jobs don't require whole nodes Does this mean that the worker nodes listed in the "squeue --start" output [*] are basically not usable until those jobs will start running ? This would explain my problem, but I don't understand the logic of this behavior Thanks again Massimo [*] [sgaravat@cld-ter-ui-01 ~]$ squeue --start JOBID PARTITION NAME USER ST START_TIME NODES SCHEDNODES NODELIST(REASON) 283565 gpus vllm-pod ciangott PD 2026-04-14T14:22:38 1 cld-ter-gpu-01 (Resources) 284080 gpus,gpus qed_supe catalano PD 2026-04-14T21:26:22 1 cld-ter-gpu-05 (Resources) 284081 gpus,gpus qed_supe catalano PD 2026-04-14T22:32:57 1 cld-ter-gpu-04 (Priority) 284099 gpus,gpus qed_supe catalano PD 2026-04-15T07:19:36 1 cld-ter-gpu-03 (Priority) 284119 onlycpus- myscript sgaravat PD 2026-04-15T11:12:26 1 btc-dfa-gpu-02 (Nodes required for job are DOWN, DRAINED or reserved for jobs in higher priority partitions) 284121 onlycpus- myscript sgaravat PD 2026-04-15T11:12:26 1 cld-dfa-gpu-06 (Priority) 284090 gpus,gpus long_isi pavesic PD N/A 1 (null) (Priority) 284091 gpus,gpus long_isi pavesic PD N/A 1 (null) (Priority) 284092 gpus,gpus long_isi pavesic PD N/A 1 (null) (Priority) 284093 gpus,gpus long_isi pavesic PD N/A 1 (null) (Priority) 284094 gpus,gpus long_isi pavesic PD N/A 1 (null) (Priority) 284095 gpus,gpus long_isi pavesic PD N/A 1 (null) (Priority) 284104 gpus,gpus qed_supe catalano PD N/A 1 (null) (Priority) On Mon, Apr 13, 2026 at 1:02 PM Massimo Sgaravatto < massimo.sgaravatto@gmail.com> wrote:
Dear all
I (try to) manage a slurm cluster composed by some CPU-only nodes and some worker nodes which have also GPUs:
NodeName=cld-ter-[01-06] Sockets=2 CoresPerSocket=96 ThreadsPerCore=2 RealMemory=1536000 State=UNKNOWN NodeName=cld-ter-gpu-[01-05] Sockets=2 CoresPerSocket=96 ThreadsPerCore=2 Gres=gpu:nvidia-h100:4 RealMemory=1536000 State=UNKNOWN
The GPU nodes are exposed through multiple partitions:
PartitionName=gpus Nodes=cld-ter-gpu-[01-02] State=UP PriorityTier=20 PartitionName=sparch Nodes=cld-ter-gpu-03 AllowAccounts=sparch,operators QoS=sparch State=UP PriorityTier=20 PartitionName=geant4 Nodes=cld-ter-gpu-03 AllowAccounts=geant4,operators QoS=geant4 State=UP PriorityTier=20 PartitionName=enipred Nodes=cld-ter-gpu-04 AllowAccounts=enipred,operators QoS=enipred State=UP PriorityTier=20 PartitionName=enipiml Nodes=cld-ter-gpu-05 AllowAccounts=enipiml,operators QoS=enipiml State=UP PriorityTier=20
We also set a partition to allow cpu-only jobs on the GPU nodes, but these jobs should be preempted (killed and requeued) if jobs submitted to partitions with higher priorities require those resources:
PreemptType=preempt/partition_prio PreemptMode=REQUEUE PartitionName=onlycpus-opp Nodes=cld-ter-gpu-[01-05],cld-dfa-gpu-06,btc-dfa-gpu-02 State=UP PriorityTier=10
Now, I don't understand why this job [*] submitted on the onlycpus-opp partition can't start running e.g. on the cld-ter-gpu-01, since it has a lot of free resources:
[sgaravat@cld-ter-ui-01 ~]$ scontrol show node cld-ter-gpu-01 NodeName=cld-ter-gpu-01 Arch=x86_64 CoresPerSocket=96 CPUAlloc=8 CPUEfctv=384 CPUTot=384 CPULoad=5.93 AvailableFeatures=(null) ActiveFeatures=(null) Gres=gpu:nvidia-h100:4 NodeAddr=cld-ter-gpu-01 NodeHostName=cld-ter-gpu-01 Version=25.11.3 OS=Linux 5.14.0-611.45.1.el9_7.x86_64 #1 SMP PREEMPT_DYNAMIC Wed Apr 1 05:56:53 EDT 2026 RealMemory=1536000 AllocMem=560000 FreeMem=1192357 Sockets=2 Boards=1 State=MIXED+PLANNED ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A Partitions=gpus,onlycpus-opp BootTime=2026-04-09T10:39:35 SlurmdStartTime=2026-04-09T10:40:01 LastBusyTime=2026-04-09T11:54:46 ResumeAfterTime=None CfgTRES=cpu=384,mem=1500G,billing=839,gres/gpu=4,gres/gpu:nvidia-h100=4 AllocTRES=cpu=8,mem=560000M,gres/gpu=4,gres/gpu:nvidia-h100=4 CurrentWatts=0 AveWatts=0
I guess the "MIXED+PLANNED" is the answer, but as far as I can see only a job (283469) is planned for this worker node:
sgaravat@cld-ter-ui-01 ~]$ squeue --start | grep ter-gpu-01 JOBID PARTITION NAME USER ST START_TIME NODES SCHEDNODES NODELIST(REASON)
283469 gpus vllm-pod ciangott PD 2026-04-13T14:31:40 1 cld-ter-gpu-01 (Resources)
But job 283469 doesn't require too many resources [**], so the 2 jobs could run together. Why job 283534 can't start ? Any hints ?
Thanks, Massimo
[*]
[sgaravat@cld-ter-ui-01 ~]$ scontrol show job=283534 JobId=283534 JobName=myscript.sh UserId=sgaravat(5008) GroupId=tbadmin(5001) MCS_label=N/A Priority=542954 Nice=0 Account=operators QOS=normal JobState=RUNNING Reason=None Dependency=(null) Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0 RunTime=00:00:41 TimeLimit=1-00:00:00 TimeMin=N/A SubmitTime=2026-04-13T11:10:13 EligibleTime=2026-04-13T11:10:13 AccrueTime=2026-04-13T11:10:13 StartTime=2026-04-13T11:58:39 EndTime=2026-04-14T11:58:39 Deadline=N/A PreemptEligibleTime=2026-04-13T11:58:39 PreemptTime=None SuspendTime=None SecsPreSuspend=0 LastSchedEval=2026-04-13T11:58:39 Scheduler=Backfill Partition=onlycpus-opp AllocNode:Sid=cld-ter-ui-01:3035857 ReqNodeList=(null) ExcNodeList=(null) NodeList=btc-dfa-gpu-02 BatchHost=btc-dfa-gpu-02 NumNodes=1 NumCPUs=1 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:* ReqTRES=cpu=1,mem=100G,node=1,billing=26 AllocTRES=cpu=1,mem=100G,node=1,billing=26 Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=* MinCPUsNode=1 MinMemoryNode=100G MinTmpDiskNode=0 Features=(null) DelayBoot=00:00:00 OverSubscribe=OK Contiguous=0 Licenses=(null) LicensesAlloc=(null) Network=(null) Command=/shared/home/sgaravat/myscript.sh SubmitLine=sbatch myscript.sh WorkDir=/shared/home/sgaravat StdErr=/shared/home/sgaravat/JOB-myscript.sh.283534.4294967294.err StdIn=/dev/null StdOut=/shared/home/sgaravat/JOB-myscript.sh.283534.4294967294.out MailUser=massimo.sgaravatto@pd.infn.it MailType=INVALID_DEPEND,BEGIN,END,FAIL,REQUEUE,STAGE_OUT
[**] sgaravat@cld-ter-ui-01 ~]$ scontrol show job=283469 JobId=283469 JobName=vllm-pod UserId=ciangott(6054) GroupId=tbuser(6000) MCS_label=N/A Priority=499703 Nice=0 Account=cms QOS=normal JobState=PENDING Reason=Resources Dependency=(null) Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0 RunTime=00:00:00 TimeLimit=1-00:00:00 TimeMin=N/A SubmitTime=2026-04-13T06:48:37 EligibleTime=2026-04-13T06:48:37 AccrueTime=2026-04-13T06:48:37 StartTime=2026-04-13T14:31:40 EndTime=2026-04-14T14:31:40 Deadline=N/A SuspendTime=None SecsPreSuspend=0 LastSchedEval=2026-04-13T11:59:48 Scheduler=Main Partition=gpus AllocNode:Sid=cld-ter-ui-01:3015801 ReqNodeList=(null) ExcNodeList=(null) NodeList= SchedNodeList=cld-ter-gpu-01 NumNodes=1-1 NumCPUs=32 NumTasks=1 CPUs/Task=32 ReqB:S:C:T=0:0:*:*
ReqTRES=cpu=32,mem=190734M,node=1,billing=118,gres/gpu=2,gres/gpu:nvidia-h100=2 AllocTRES=(null) Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=* MinCPUsNode=32 MinMemoryNode=190734M MinTmpDiskNode=0 Features=(null) DelayBoot=00:00:00 OverSubscribe=OK Contiguous=0 Licenses=(null) LicensesAlloc=(null) Network=(null)
Command=.interlink/jobs/default-0c0257f8-d1ea-4135-a602-96c229ce8516/job.slurm SubmitLine=sbatch .interlink/jobs/default-0c0257f8-d1ea-4135-a602-96c229ce8516/job.slurm WorkDir=/shared/home/ciangott StdErr= StdIn=/dev/null
StdOut=/shared/home/ciangott/.interlink/jobs/default-0c0257f8-d1ea-4135-a602-96c229ce8516/job.out TresPerNode=gres/gpu:nvidia-h100:2 TresPerTask=cpu=32
participants (5)
-
Christopher Samuel -
Davide DelVento -
Diego Zuccato -
Massimo Sgaravatto -
Ole Holm Nielsen