<!DOCTYPE html><html><head><title></title><style type="text/css">p.MsoNormal,p.MsoNoSpacing{margin:0}</style></head><body><div style="font-size:16px;"><br></div><div style="font-size:16px;">Hello.<br></div><div style="font-size:16px;"><br></div><div style="font-size:16px;">I'm running Slurm 18.08.1 and had configured limits for our users using QOS. The default QOS has the limits set. Most users belong to this. <br></div><div style="font-size:16px;"><br></div><div style="font-size:16px;"># sacctmgr show qos<br></div><div style="font-size:16px;">      Name   Priority  GraceTime    Preempt PreemptMode                                    Flags UsageThres UsageFactor       GrpTRES   GrpTRESMins GrpTRESRunMin GrpJobs GrpSubmit     GrpWall       MaxTRES MaxTRESPerNode   MaxTRESMins     MaxWall     MaxTRESPU MaxJobsPU MaxSubmitPU     MaxTRESPA MaxJobsPA MaxSubmitPA       MinTRES<br></div><div style="font-size:16px;">---------- ---------- ---------- ---------- ----------- ---------------------------------------- ---------- ----------- ------------- ------------- ------------- ------- --------- ----------- ------------- -------------- ------------- ----------- ------------- --------- ----------- ------------- --------- ----------- -------------<br></div><div style="font-size:16px;">    normal          0   00:00:00                cluster                                                        1.000000                                                                                                                                cpu=72,mem=7+                 10000                                                  <br></div><div style="font-size:16px;">       nav          0   00:00:00                cluster                                                        1.000000                                                                                                                                                                                                                     <br></div><div style="font-size:16px;">       eva          0   00:00:00                cluster                                                        1.000000                                                                                                                                                                    cpu=18,mem=1+                                    <br></div><div style="font-size:16px;">emre-high          0   00:00:00                cluster                                                        1.000000<br></div><div style="font-size:16px;"><br></div><div style="font-size:16px;"><br></div><div style="font-size:16px;">Nothing has changed recently, and today, I noticed that the QOS limits which were working until now has silently stopped working. A user was able to submit jobs enough to saturate the cluster singlehandedly annoying other users.<br></div><div style="font-size:16px;"><br></div><div style="font-size:16px;">There are no errors in slurmctld logs..<br></div><div style="font-size:16px;"><br></div><div style="font-size:16px;">How can I go about troubleshooting this? Any suggestions welcome..<br></div><div style="font-size:16px;"><br></div><div style="font-size:16px;">/ Aravindh</div><div style="font-size:16px;"><br></div><div style="font-size:16px;"><br></div><div style="font-size:16px;"><br></div><div id="sig56753105"><div class="signature">-- <br></div><div class="signature">  Aravindh Sampathkumar<br></div><div class="signature">  aravindh@fastmail.com<br></div><div class="signature"><br></div></div><div style="font-size:16px;"><br></div></body></html>