<div dir="ltr"><div class="gmail_default" style="font-family:verdana,sans-serif">Hi Slurm-Users,</div><div class="gmail_default" style="font-family:verdana,sans-serif"><br></div><div class="gmail_default" style="font-family:verdana,sans-serif">Hope this post finds all of you healthy and safe amidst the ongoing COVID19 craziness. We've got a strange error state that occurs when we enable preemption and we need help diagnosing what is wrong. I'm not sure if we are missing a default value or other necessary configuration, but while trying to enable slurm preemption on a cluster with multiple queues slurm itself stops reporting all 40+ CPUs on each node and only reports a single cpu per node [after some random amount of time]. This is problematic on multiple levels and has led to issues with users submitting jobs with more than one CPU.</div><div class="gmail_default" style="font-family:verdana,sans-serif"><br></div><div class="gmail_default" style="font-family:verdana,sans-serif">For some quick background on our setup we have a 100+ node linux cluster which is built on lustre for storage, is managed using Bright View and uses Slurm for its scheduler. The slurm.conf file lives on a shared volume that is mounted across all the nodes on one of the lustre file systems. We have defined a number of queues for slurm to use and have three distinct tiers of workloads. Before setting out we looked around but were unable to find a succinct how-to on the web describing how to configure this type of 3-tier design we desired to make, so I'll outline the steps we took below. We've tried a number of variations of the Examples from <a href="https://slurm.schedmd.com/preempt.html" style="font-family:Arial,Helvetica,sans-serif">https://slurm.schedmd.com/preempt.html</a> but none exactly match the model we desire so it may be we are missing key configuration options still.<br></div><div class="gmail_default" style="font-family:verdana,sans-serif"><br></div><div class="gmail_default" style="font-family:verdana,sans-serif">The desired high-level design is for all compute and gpu nodes to exist in a lowest priority "windfall" queue (PriorityTier value of 100) with a medium priority pair of default queues above it  (PriorityTier value of 200) -- these are called "defq" and "gpuq" for ease of use -- and finally 20 or so specific high-priority queues for particular research groups above that (PriorityTier value of 300) which are limited to just a few nodes per queue and should take final precedence. </div><div class="gmail_default" style="font-family:verdana,sans-serif"><br></div><div class="gmail_default" style="font-family:verdana,sans-serif">As for how to handle the preemption on each tier we don't plan to SUSPEND jobs, but rather to CANCEL a windfall job or REQUEUE a defq/gpuq job when a higher-priority job from the researcher specific queues requests a resource that is already in use by a lower priority job. The final layout looks something like so:</div><div class="gmail_default" style="font-family:verdana,sans-serif"><br></div><div class="gmail_default" style="font-family:verdana,sans-serif">PriorityTier   PreemptMode   QueueType  NodeType</div><div class="gmail_default" style="font-family:verdana,sans-serif">100             CANCEL            windfall       all</div><div class="gmail_default" style="font-family:verdana,sans-serif">200             REQUEUE          defq           cpu</div><div class="gmail_default" style="font-family:verdana,sans-serif">200             REQUEUE          gpuq          gpu</div><div class="gmail_default" style="font-family:verdana,sans-serif">300             REQUEUE          lab1           cpu</div><div class="gmail_default" style="font-family:verdana,sans-serif">300             REQUEUE          lab2           cpu</div><div class="gmail_default" style="font-family:verdana,sans-serif">300             ...                     ...              ...</div><div class="gmail_default" style="font-family:verdana,sans-serif">(etc.)</div><div class="gmail_default" style="font-family:verdana,sans-serif"><br></div><div class="gmail_default" style="font-family:verdana,sans-serif">Once this was laid out the next step was to ensure each queue that we created had a predefined value of "CANCEL" or "REQUEUE" rather than "OFF" before enabling the 'preempt/partition_prio' plugin or we'd get an error. Since the initial cluster design didn't use preemption we added the PriorityType line first: </div><div class="gmail_default" style="font-family:verdana,sans-serif"><br></div><div class="gmail_default" style="font-family:verdana,sans-serif">> PriorityType=priority/multifactor</div><div class="gmail_default" style="font-family:verdana,sans-serif"><br></div><div class="gmail_default" style="font-family:verdana,sans-serif">Then we added the following 2 lines to the slurm.conf config file which seemed to enable the preemption.</div><div class="gmail_default" style="font-family:verdana,sans-serif"><br></div><div class="gmail_default" style="font-family:verdana,sans-serif">> PreemptType=preempt/partition_prio<br>> PreemptMode=REQUEUE<br><br>As far as I understand those 2 lines should enable the plugin (and set the global default preemption mode for good measure). </div><div class="gmail_default" style="font-family:verdana,sans-serif"><br></div><div class="gmail_default" style="font-family:verdana,sans-serif">For testing the changes we created a smaller queue with only 3 nodes so that we could call up some interactive jobs and watch them be canceled or requeued as we request higher priority workloads. Our issue occurs when we enable the preempt type. At first everything seems to be working fine, however after some random amount of time all the nodes stop reporting 40+ CPUs and report only a single CPU. This is visible to the admin via `sinfo --Node --long` and to the users by the fact only single CPU jobs can be requested.</div><div class="gmail_default" style="font-family:verdana,sans-serif"><br></div><div class="gmail_default" style="font-family:verdana,sans-serif">It makes no sense. It's just like all of a sudden the computers only have one CPU. All the more frustrating is the fact it also doesn't stop misbehaving right away when we change it back to the previous configuration. </div><div class="gmail_default" style="font-family:verdana,sans-serif"><br>Big question: Is this an issue anyone has seen before? Any clue what we are doing wrong or how to further diagnose the problem when it occurs? </div><div class="gmail_default" style="font-family:verdana,sans-serif"><br></div><div class="gmail_default" style="font-family:verdana,sans-serif">At the moment my thoughts for next steps are to turn up slurm debugging and to purposefully let the error happen again, but testing on a production cluster always scares me a little. Any thoughts about what log to check and what kind of events to watch for would be greatly appreciated. We are open to any thoughts or suggestions!</div><div class="gmail_default" style="font-family:verdana,sans-serif"><br></div><div class="gmail_default" style="font-family:verdana,sans-serif"><div class="gmail_default">Also a bit unclear about how the priority calculation is made. I looked at the values generated and they didn't seem to map to the changes in the queues PriorityValue. I tried limiting the priority calculation to ONLY use the partition priority with these additional config options below, but still didn't get a nice clean calculation like I hoped. </div><div class="gmail_default"><br></div><div class="gmail_default">> PriorityWeightFairshare=0<br>> PriorityWeightAge=0<br>> PriorityWeightTRES=0<br>> PriorityWeightPartition=100000<br>> PriorityWeightJobSize=0<br>> PriorityWeightQOS=0</div></div><div class="gmail_default" style="font-family:verdana,sans-serif"><br></div><div class="gmail_default" style="font-family:verdana,sans-serif">Thanks in advance,</div><div class="gmail_default" style="font-family:verdana,sans-serif">Josh</div></div>