[slurm-users] Large job starvation on cloud cluster
Michael Gutteridge
michael.gutteridge at gmail.com
Thu Feb 28 15:29:15 UTC 2019
sprio --long shows:
JOBID PARTITION USER PRIORITY AGE FAIRSHARE JOBSIZE
PARTITION QOS NICE TRES
...
2203317 largenode alice 1000010 10 0 0
0 1000000 0 2203318 largenode alice 1000010
10 0 0 0 1000000 0
2203319 largenode alice 1000010 10 0 0
0 1000000 0
2203320 largenode alice 1000010 10 0 0
0 1000000 0
2203321 largenode alice 1000010 10 0 0
0 1000000 0
2221670 largenode me 3000000 0 0 0
0 1000000 -2000000
root at beagle-ctld:~#
`squeue --start` all show:
2221670 largenode sleeper. me PD N/A 1 (null)
(AssocGrpCpuLimit)
JobId=2221670 JobName=sleeper.sh
UserId=me(12345) GroupId=g_me(12345) MCS_label=N/A
Priority=3000000 Nice=-2000000 Account=account QOS=normal
JobState=PENDING Reason=AssocGrpCpuLimit Dependency=(null)
Requeue=0 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
RunTime=00:00:00 TimeLimit=3-00:00:00 TimeMin=N/A
SubmitTime=2019-02-28T06:11:17 EligibleTime=2019-02-28T06:11:17
AccrueTime=2019-02-28T06:11:17
StartTime=Unknown EndTime=Unknown Deadline=N/A
PreemptTime=None SuspendTime=None SecsPreSuspend=0
LastSchedEval=2019-02-28T06:32:27
Partition=largenode AllocNode:Sid=fitzroy:13714
ReqNodeList=(null) ExcNodeList=(null)
NodeList=(null)
NumNodes=1 NumCPUs=16 NumTasks=1 CPUs/Task=16 ReqB:S:C:T=0:0:*:*
TRES=cpu=16,node=1,billing=16
Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
MinCPUsNode=16 MinMemoryNode=0 MinTmpDiskNode=0
Features=(null) DelayBoot=00:00:00
OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
Command=./sleeper.sh
WorkDir=/home/me/tutorial/run
StdErr=/home/me/tutorial/run/slurm-2221670.out
StdIn=/dev/null
StdOut=/home/me/tutorial/run/slurm-2221670.out
Power=
I think you are correct about that, though I'm not sure how to debug those
reservations.
Thanks
M
On Wed, Feb 27, 2019 at 10:22 PM Chris Samuel <chris at csamuel.org> wrote:
> On Wednesday, 27 February 2019 1:08:56 PM PST Michael Gutteridge wrote:
>
> > Yes, we do have time limits set on partitions- 7 days maximum, 3 days
> > default. In this case, the larger job is requesting 3 days of walltime,
> > the smaller jobs are requesting 7.
>
> It sounds like no forward reservation is being created for the larger job,
> what do these say?
>
> sprio -l
>
> squeue --start
>
> scontrol show job ${LARGE_JOBID}
>
> All the best,
> Chris
> --
> Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA
>
>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20190228/795cace9/attachment.html>
More information about the slurm-users
mailing list