[slurm-users] Job are pending when plenty of resources available
Renfro, Michael
Renfro at tntech.edu
Mon Mar 30 13:47:08 UTC 2020
All of this is subject to scheduler configuration, but: what has job 409978 requested, in terms of resources and time? It looks like it's the highest priority pending job in the interactive partition, and I’d expect the interactive partition has a higher priority than the regress partition.
As for job 409999, it’s requesting 8 cores and 32 GB of RAM for an infinite amount of time, not 1 core and 1 GB of RAM.
*If* job 409978 has requested an large amount of time on the entire cluster, *and* you don’t have backfill running, I could see this situation happening.
--
Mike Renfro, PhD / HPC Systems Administrator, Information Technology Services
931 372-3601 / Tennessee Tech University
> On Mar 29, 2020, at 10:17 PM, Carter, Allan <cartalla at amazon.com> wrote:
>
>
> I’m perplexed. My cluster has been churning along and tonight it has decided to start pending jobs even though there are plenty of nodes available.
>
> An example job from squeue:
>
> JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
> 409978 interacti verdi amirinen PD 0:00 1 (Resources)
> 409989 regress update_r jenkins PD 0:00 1 (Nodes required for job are DOWN, DRAINED or reserved for jobs in higher priority partitions)
> 409985 regress update_r amirinen PD 0:00 1 (Nodes required for job are DOWN, DRAINED or reserved for jobs in higher priority partitions)
> 409982 regress update_r akshabal PD 0:00 1 (Nodes required for job are DOWN, DRAINED or reserved for jobs in higher priority partitions)
> 409994 regress SYN__tpb kumarbck PD 0:00 1 (Nodes required for job are DOWN, DRAINED or reserved for jobs in higher priority partitions)
> 409999 interacti sbatch_w akshabal PD 0:00 1 (Priority)
> 410000 regress ICC2__tp gadikon PD 0:00 1 (Nodes required for job are DOWN, DRAINED or reserved for jobs in higher priority partitions)
> 410005 regress update_r amirinen PD 0:00 1 (Nodes required for job are DOWN, DRAINED or reserved for jobs in higher priority partitions)
> 410003 regress update_r bachchuk PD 0:00 1 (Nodes required for job are DOWN, DRAINED or reserved for jobs in higher priority partitions)
> 410006 regress update_r saurahuj PD 0:00 1 (Nodes required for job are DOWN, DRAINED or reserved for jobs in higher priority partitions)
> 410009 regress xterm_fi gadikon PD 0:00 1 (Nodes required for job are DOWN, DRAINED or reserved for jobs in higher priority partitions)
> 410010 regress ICC2__tp gadikon PD 0:00 1 (Nodes required for job are DOWN, DRAINED or reserved for jobs in higher priority partitions)
> 410001 regress ICC2__tp gadikon PD 0:00 1 (Dependency)
> 410002 regress ICC2__tp gadikon PD 0:00 1 (Dependency)
> 410004 regress ICC2__tp gadikon PD 0:00 1 (Dependency)
> 410011 regress ICC2__tp gadikon PD 0:00 1 (Dependency)
> 410014 regress ICC2__tp gadikon PD 0:00 1 (Dependency)
> 410015 regress ICC2__tp gadikon PD 0:00 1 (Dependency)
> 409937 interacti verdi nsamra R 5:51:10 1 c7-c5n-18xl-3
>
> The output of sinfo shows plenty of nodes available for the scheduler.
>
> PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
> all up infinite 31954 idle~ al2-t3-2xl-[0-999],al2-t3-l-[0-999],c7-c5-24xl-[0-5,7-10,14,16-17,19,21-46,48-151,153-155,157-164,167,169-485,487-999],c7-c5d-24xl-[0,2-999],c7-c5n-18xl-[0-2,4-14,16-26,28-44,46,48-51,53-54,56-63,65-67,69-72,74-75,77-82,84,86-99,101-999],c7-m5-24xl-[0-325,327-999],c7-m5d-24xl-[0-191,193-999],c7-m5dn-24xl-[0-3,5-97,99-999],c7-m5n-24xl-[0-24,26-999],c7-r5d-16xl-[0-3,5-999],c7-r5d-24xl-[1-16,18-999],c7-r5dn-24xl-[0-1,3-999],c7-t3-2xl-[0-8,10-970,973-999],c7-t3-l-[0-999],c7-x1-32xl-[0-6,8-999],c7-x1e-32xl-[0-999],c7-z1d-12xl-[0,2-5,7,9-10,12-999],rh7-c5-24xl-[0-999],rh7-c5d-24xl-[0-999],rh7-c5n-18xl-[0-999],rh7-m5-24xl-[0-999],rh7-m5d-24xl-[0-999],rh7-m5dn-24xl-[0-999],rh7-m5n-24xl-[0-999],rh7-r5d-16xl-[0-999],rh7-r5d-24xl-[0-999],rh7-r5dn-24xl-[0-999],rh7-t3-2xl-[0-999],rh7-t3-l-[0-999],rh7-x1-32xl-[0-999],rh7-x1e-32xl-[0-999],rh7-z1d-12xl-[0-999]
> all up infinite 2 drain c7-t3-l-s-0,rh7-t3-l-s-0
> all up infinite 46 mix c7-c5-24xl-[6,11-13,15,18,20,47,152,156,165-166,168,486],c7-c5d-24xl-1,c7-c5n-18xl-[3,15,27,45,47,52,55,64,68,73,76,83,85,100],c7-m5-24xl-326,c7-m5d-24xl-192,c7-m5dn-24xl-[4,98],c7-m5n-24xl-25,c7-r5d-16xl-4,c7-r5d-24xl-[0,17],c7-r5dn-24xl-2,c7-t3-2xl-[9,971-972],c7-x1-32xl-7,c7-z1d-12xl-[1,6,8,11]
> all up infinite 1 idle al2-t3-l-s-0
>
> The job isn’t requesting anything special. Just 1 core and 1G of memory.
>
> Any thoughts on why the scheduler would just stop scheduling jobs? This cluster is running on AWS and it’s my intention to provide enough nodes so that jobs never queue and so far it’s been working until now.
>
> I’ve tried restarting slurmctld with an increased logging level, but no progress.
>
> I see the following messages in slurmctld.log
>
> [2020-03-29T21:55:58.951] debug: sched: Running job scheduler
> [2020-03-29T21:55:58.953] debug: sched: JobId=409999. State=PENDING. Reason=Priority, Priority=100013. Partition=interactive.
> [2020-03-29T21:56:58.932] debug: sched: Running job scheduler
> [2020-03-29T21:56:58.934] debug: sched: JobId=409999. State=PENDING. Reason=Priority, Priority=100013. Partition=interactive.
>
> The output of scontrol for this job is:
>
> JobId=409999 JobName=sbatch_wrap.sh
> UserId=akshabal(67674) GroupId=domain_users(66049) MCS_label=N/A
> Priority=100013 Nice=0 Account=(null) QOS=normal
> JobState=PENDING Reason=Priority Dependency=(null)
> Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
> RunTime=00:00:00 TimeLimit=UNLIMITED TimeMin=N/A
> SubmitTime=2020-03-29T19:34:09 EligibleTime=2020-03-29T19:34:09
> AccrueTime=2020-03-29T19:34:09
> StartTime=2020-03-29T21:51:27 EndTime=Unknown Deadline=N/A
> SuspendTime=None SecsPreSuspend=0 LastSchedEval=2020-03-29T21:50:58
> Partition=interactive AllocNode:Sid=a-2vaol6a8g9ca8.mla.annapurna.aws.a2z.com:7549
> ReqNodeList=(null) ExcNodeList=(null)
> NodeList=(null)
> NumNodes=1 NumCPUs=8 NumTasks=1 CPUs/Task=8 ReqB:S:C:T=0:0:*:*
> TRES=cpu=8,mem=32G,node=1,billing=8
> Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
> MinCPUsNode=8 MinMemoryNode=32G MinTmpDiskNode=0
> Features=(null) DelayBoot=00:00:00
> OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
> Command=/tools/slurm/bin/sbatch_wrap.sh jg source.tcl
> WorkDir=/proj/trench_work4/akshabal/wa_fixes_array_sequencer/verif/fv/sunda_tpb/tpb_state_buf
> StdErr=/proj/trench_work4/akshabal/wa_fixes_array_sequencer/verif/fv/sunda_tpb/tpb_state_buf/slurm-409999.out
> StdIn=/dev/null
> StdOut=/proj/trench_work4/akshabal/wa_fixes_array_sequencer/verif/fv/sunda_tpb/tpb_state_buf/slurm-409999.out
> Power=
>
> How do I go about debugging this?
More information about the slurm-users
mailing list