<html>

<head>

<meta http-equiv="Content-Type" content="text/html; charset=Windows-1252">

</head>

<body text="#000000" bgcolor="#FFFFFF">

<p>Hi David,</p>

<p><br>

</p>

<p>You might have a look at the thread "Large job starvation on cloud cluster" that started on Feb 27; there's some good tidbits in there. Off the top without more information, I would venture that settings you have in slurm.conf end up backfilling the smaller

 jobs at the expense of scheduling the larger jobs.</p>

<p><br>

</p>

<p>Your partition configs plus accounting and scheduler configs from slurm.conf would be helpful.</p>

<p><br>

</p>

<p>Also, search for "job starvation" here: <a href="https://slurm.schedmd.com/sched_config.html">

https://slurm.schedmd.com/sched_config.html</a> as another potential starting point.</p>

<p><br>

</p>

<p>Best,</p>

<p>Cyrus</p>

<p><br>

</p>

<div class="moz-cite-prefix">On 3/21/19 8:55 AM, David Baker wrote:<br>

</div>

<blockquote type="cite" cite="mid:AM6PR04MB46466CCDED1C643BD75F08CFFE420@AM6PR04MB4646.eurprd04.prod.outlook.com">

<style type="text/css" style="display:none;"><!-- P {margin-top:0;margin-bottom:0;} --></style>

<div id="divtagdefaultwrapper" dir="ltr" style="font-size:12pt;

        color:rgb(0,0,0);

        font-family:Calibri,Helvetica,sans-serif,EmojiFont,"Apple

        Color Emoji","Segoe UI

        Emoji",NotoColorEmoji,"Segoe UI

        Symbol","Android Emoji",EmojiSymbols">

<p style="margin-top:0; margin-bottom:0">Hello,</p>

<p style="margin-top:0; margin-bottom:0"><br>

</p>

<p style="margin-top:0; margin-bottom:0">I understand that this is not a straight forward question, however I'm wondering if anyone has any useful ideas, please. Our cluster is busy and the QOS has limited users to a maximum of 32 compute nodes on the "batch"

 queue. Users are making good of the cluster -- for example one user is running five 6 node jobs at the moment. On the other hand, a job belonging to another user has been stalled in the queue for around 7 days. He has made reasonable use of the cluster and

 as a result his fairshare component is relatively low. Having said that, the priority of his job is high -- it currently one of the highest priority jobs in the batch partition queue. From sprio...</p>

<p style="margin-top:0; margin-bottom:0"><br>

</p>

<p style="margin-top:0; margin-bottom:0"><span>JOBID PARTITION   PRIORITY        AGE  FAIRSHARE    JOBSIZE  PARTITION        QOS</span><br>

</p>

<p style="margin-top:0; margin-bottom:0"><span>359323 batch         180292     100000      79646        547        100          0</span><br>

</p>

<p style="margin-top:0; margin-bottom:0"><span><br>

</span></p>

<p style="margin-top:0; margin-bottom:0"><span>I did think that the PriorityDecayHalfLife was quite high at 14 days and so I reduced that to 7 days. For reference I've included the key scheduling settings from the cluster below. Does anyone have any thoughts,

 please? </span></p>

<p style="margin-top:0; margin-bottom:0"><span><br>

</span></p>

<p style="margin-top:0; margin-bottom:0"><span>Best regards,</span></p>

<p style="margin-top:0; margin-bottom:0"><span>David</span></p>

<p style="margin-top:0; margin-bottom:0"><span><br>

</span></p>

<p style="margin-top:0; margin-bottom:0"><span></span></p>

<div>PriorityDecayHalfLife   = 7-00:00:00</div>

<div>PriorityCalcPeriod      = 00:05:00</div>

<div>PriorityFavorSmall      = No</div>

<div>PriorityFlags           = ACCRUE_ALWAYS,SMALL_RELATIVE_TO_TIME,FAIR_TREE</div>

<div>PriorityMaxAge          = 7-00:00:00</div>

<div>PriorityUsageResetPeriod = NONE</div>

<div>PriorityType            = priority/multifactor</div>

<div>PriorityWeightAge       = 100000</div>

<div>PriorityWeightFairShare = 1000000</div>

<div>PriorityWeightJobSize   = 10000000</div>

<div>PriorityWeightPartition = 1000</div>

<div>PriorityWeightQOS       = 10000</div>

<br>

<p style="margin-top:0; margin-bottom:0"><span><br>

</span></p>

<p style="margin-top:0; margin-bottom:0"><span><br>

</span></p>

</div>

</blockquote>

</body>

</html>