[slurm-users] Jobs waiting while plenty of cpu and memory available

Edward Ned Harvey (slurm) slurm at nedharvey.com
Tue Jul 9 14:07:42 UTC 2019


> From: slurm-users <slurm-users-bounces at lists.schedmd.com> On Behalf Of
> Ole Holm Nielsen
> Sent: Tuesday, July 9, 2019 2:36 AM
> 
> When some jobs are pending with Reason=Priority this means that other
> jobs with a higher priority are waiting for the same resources (CPUs) to
> become available, and they will have Pending=Resources in the squeue
> output.

Yeah, that's exactly the problem. There are plenty of cpu and memory resources available, yet jobs are waiting. Is there any way to know what resources, specifically, the jobs are waiting for, or what jobs are ahead of a particular job in queue, so I can then look at what resources the first job requires? "scontrol show partition" doesn't reveal any clear problems:

    PartitionName=batch
       AllowGroups=ALL AllowAccounts=ALL DenyQos=foo,bar,baz
       AllocNodes=ALL Default=YES QoS=N/A
       DefaultTime=00:15:00 DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO
       MaxNodes=UNLIMITED MaxTime=3-00:00:00 MinNodes=1 LLN=NO MaxCPUsPerNode=UNLIMITED
       Nodes=alpha[003-068],omega[003-068]
       PriorityJobFactor=1 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=NO
       OverTimeLimit=NONE PreemptMode=REQUEUE
       State=UP TotalCPUs=4321 TotalNodes=123 SelectTypeParameters=NONE
       DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED

The QoS policies are not new, and have not changed recently, yet the problem of jobs pending is a new problem. I can't seem to get any information about why they're pending.




More information about the slurm-users mailing list