[slurm-users] Job scheduling bug?
Luke Sudbery
l.r.sudbery at bham.ac.uk
Wed May 10 11:09:55 UTC 2023
After a bit more investigation it seem it is only jobs which request GPUs which are not starting.
Other jobs start OK, but just requesting a GPU sit in Pending (Resources) state until the controller is restarted, even if no jobs are running on the node at all. This definitely doesn't seem right to me.
There are currently user jobs on the node but if it frees up I can run some more tests regarding if jobs submitted after a controller restart start once and only once per GPU or what is going on.
Many thanks,
Luke
--
Luke Sudbery
Principal Engineer (HPC and Storage).
Architecture, Infrastructure and Systems
Advanced Research Computing, IT Services
Room 132, Computer Centre G5, Elms Road
Please note I don't work on Monday.
From: slurm-users <slurm-users-bounces at lists.schedmd.com> On Behalf Of Luke Sudbery
Sent: 09 May 2023 17:38
To: slurm-users at schedmd.com
Subject: [slurm-users] Job scheduling bug?
We recently upgraded from 20.11.9 to 22.05.8 and appear to have a problem with jobs not being scheduled on nodes with free resources since then.
It particularly noticeable on one particular partition with only one GPU node in it. Jobs queuing for this node are the highest priority in the queue at the moment, and the node is idle, but the job does not start:
[sudberlr-admin at bb-er-slurm01 ~]$ squeue -p broadwell-gpum60-ondemand --format "%.18i %.9P %.2t %.10M %.6D %30R %Q"
JOBID PARTITION ST TIME NODES NODELIST(REASON) PRIORITY
66631657 broadwell PD 0:00 1 (Resources) 230
66609948 broadwell PD 0:00 1 (Resources) 203
[sudberlr-admin at bb-er-slurm01 ~]$ squeue --format "%Q %i" --sort -Q | head -4
PRIORITY JOBID
230 66631657
212 66622378
210 66322847
[sudberlr-admin at bb-er-slurm01 ~]$ scontrol show node bear-pg0212u17b
NodeName=bear-pg0212u17b Arch=x86_64 CoresPerSocket=10
CPUAlloc=0 CPUEfctv=20 CPUTot=20 CPULoad=0.01
AvailableFeatures=haswell
ActiveFeatures=haswell
Gres=gpu:m60:2(S:0-1)
NodeAddr=bear-pg0212u17b NodeHostName=bear-pg0212u17b Version=22.05.8
OS=Linux 3.10.0-1160.49.1.el7.x86_64 #1 SMP Tue Nov 30 15:51:32 UTC 2021
RealMemory=511000 AllocMem=0 FreeMem=501556 Sockets=2 Boards=1
MemSpecLimit=501
State=IDLE ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
Partitions=broadwell-gpum60-ondemand,system
BootTime=2023-04-25T08:24:10 SlurmdStartTime=2023-05-04T11:57:46
LastBusyTime=2023-05-09T13:27:07
CfgTRES=cpu=20,mem=511000M,billing=20,gres/gpu=2
AllocTRES=
CapWatts=n/a
CurrentWatts=0 AveWatts=0
ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
[sudberlr-admin at bb-er-slurm01 ~]$
The resources it requests easily met by the node:
[sudberlr-admin at bb-er-slurm01 ~]$ scontrol show job 66631657
JobId=66631657 JobName=sys/dashboard/sys/bc_uob_paraview
UserId=XXXX(633299) GroupId=users(100) MCS_label=N/A
Priority=230 Nice=0 Account=XXXX QOS=bbondemand
JobState=PENDING Reason=Resources Dependency=(null)
Requeue=0 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
RunTime=00:00:00 TimeLimit=02:00:00 TimeMin=N/A
SubmitTime=2023-05-09T13:27:31 EligibleTime=2023-05-09T13:27:31
AccrueTime=2023-05-09T13:27:31
StartTime=Unknown EndTime=Unknown Deadline=N/A
SuspendTime=None SecsPreSuspend=0 LastSchedEval=2023-05-09T16:02:30 Scheduler=Main
Partition=broadwell-gpum60-ondemand,cascadelake-hdr-ondemand,cascadelake-hdr-ondemand2 AllocNode:Sid=localhost:1120095
ReqNodeList=(null) ExcNodeList=(null)
NodeList=
NumNodes=1-1 NumCPUs=8 NumTasks=8 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
TRES=cpu=8,mem=32G,node=1,billing=8,gres/gpu=1
Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
MinCPUsNode=1 MinMemoryCPU=4G MinTmpDiskNode=0
Features=(null) DelayBoot=00:00:00
OverSubscribe=YES Contiguous=0 Licenses=(null) Network=(null)
Command=(null)
WorkDir=/XXXXXXXXXXXXX
StdErr=/XXXXXXXXXXXXX/output.log
StdIn=/dev/null
StdOut=/XXXXXXXXXXXXX/output.log
Power=
TresPerNode=gres:gpu:1
[sudberlr-admin at bb-er-slurm01 ~]$
This looks a bug to me because it was working fine before the upgrade and a simple restart of the slurm controller will often allow the jobs to start, without any other changes:
[sudberlr-admin at bb-er-slurm01 ~]$ squeue -p broadwell-gpum60-ondemand --format "%.18i %.9P %.2t %.10M %.6D %32R %Q"
JOBID PARTITION ST TIME NODES NODELIST(REASON) PRIORITY
66631657 broadwell PD 0:00 1 (Resources) 230
66609948 broadwell PD 0:00 1 (Resources) 203
[sudberlr-admin at bb-er-slurm01 ~]$ sudo systemctl restart slurmctld; sleep 30; squeue -p broadwell-gpum60-ondemand --format "%.18i %.9P %.2t %.10M %.6D %32R %Q"
Job for slurmctld.service canceled.
JOBID PARTITION ST TIME NODES NODELIST(REASON) PRIORITY
66631657 broadwell R 0:04 1 bear-pg0212u17b 230
66609948 broadwell R 0:04 1 bear-pg0212u17b 203
[sudberlr-admin at bb-er-slurm01 ~]$
Has anyone come across this behaviour or have any other ideas?
Many thanks,
Luke
--
Luke Sudbery
Principal Engineer (HPC and Storage).
Architecture, Infrastructure and Systems
Advanced Research Computing, IT Services
Room 132, Computer Centre G5, Elms Road
Please note I don't work on Monday.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20230510/cd802993/attachment-0003.htm>
More information about the slurm-users
mailing list