At the moment we have 2 nodes that are having long wait times. Generally this is when the nodes are fully allocated. What would be the other reasons if there is still enough available memory and CPU available, that a job would take so long? Slurm version
is 23.02.4 via Bright Computing. Note the compute nodes have hyperthreading enabled but that should be irrelevant. Is there a way to determine what else could be holding jobs up?
srun --pty -t 0-01:00:00 --nodelist=node001 --gres=gpu:1 -A ourts -p short /bin/bash
srun: job 672204 queued and waiting for resources
scontrol show node node001
NodeName=m001 Arch=x86_64 CoresPerSocket=48
CPUAlloc=24 CPUEfctv=192 CPUTot=192 CPULoad=20.37
AvailableFeatures=location=local
ActiveFeatures=location=local
Gres=gpu:A6000:8
NodeAddr=node001 NodeHostName=node001 Version=23.02.4
OS=Linux 5.14.0-70.13.1.el9_0.x86_64 #1 SMP PREEMPT Thu Apr 14 12:42:38 EDT 2022
RealMemory=1031883 AllocMem=1028096 FreeMem=222528 Sockets=2 Boards=1
State=MIXED ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
Partitions=ours,short
BootTime=2024-04-29T16:18:30 SlurmdStartTime=2024-05-18T16:48:11
LastBusyTime=2024-06-03T10:49:49 ResumeAfterTime=None
CfgTRES=cpu=192,mem=1031883M,billing=192,gres/gpu=8
AllocTRES=cpu=24,mem=1004G,gres/gpu=2,gres/gpu:a6000=2
CapWatts=n/a
CurrentWatts=0 AveWatts=0
ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
grep 672204 /var/log/slurmctld
[2024-06-04T15:50:35.627] sched: _slurm_rpc_allocate_resources JobId=672204 NodeList=(null) usec=852