<html><head><meta http-equiv="content-type" content="text/html; charset=utf-8"></head><body style="overflow-wrap: break-word; -webkit-nbsp-mode: space; line-break: after-white-space;">So Ole, any thoughts on the config info I sent? <div><br></div><div>I’m still not certain if terminating a running job based on GrpTRESMins is even possible or supposed to work.</div><div><br></div><div>Hoot</div><div><br><div><br><blockquote type="cite"><div>On Apr 24, 2023, at 3:21 PM, Hoot Thompson <hoot_thompson@verizon.net> wrote:</div><br class="Apple-interchange-newline"><div><meta charset="UTF-8"><span style="caret-color: rgb(0, 0, 0); font-family: Helvetica; font-size: 12px; font-style: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration: none; float: none; display: inline !important;">See below…...</span><br style="caret-color: rgb(0, 0, 0); font-family: Helvetica; font-size: 12px; font-style: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration: none;"><div style="caret-color: rgb(0, 0, 0); font-family: Helvetica; font-size: 12px; font-style: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration: none;"><br><blockquote type="cite"><div>On Apr 24, 2023, at 1:55 PM, Ole Holm Nielsen <Ole.H.Nielsen@fysik.dtu.dk> wrote:</div><br class="Apple-interchange-newline"><div><div>On 24-04-2023 18:33, Hoot Thompson wrote:<br><blockquote type="cite">In my reading of the Slurm documentation, it seems that exceeding the limits set in GrpTRESMins should result in terminating a running job. However, in testing this, The ‘current value’ of the GrpTRESMins only updates upon job completion and is not updated as the job progresses. Therefore jobs aren’t being stopped. On the positive side, no new jobs are started if the limit is exceeded. Here’s the documentation that is confusing me…..<br></blockquote><br>I think the jobs resource usage will only be added to the Slurm database upon job completion. I believe that Slurm doesn't update the resource usage continually as you seem to expect.<br><br><blockquote type="cite">If any limit is reached, all running jobs with that TRES in this group will be killed, and no new jobs will be allowed to run.<br>Perhaps there is a setting or misconfiguration on my part.<br></blockquote><br>The sacctmgr manual page states:<br><br><blockquote type="cite">GrpTRESMins=TRES=<minutes>[,TRES=<minutes>,...]<br>The total number of TRES minutes that can possibly be used by past, present and future jobs running from this association and its children. To clear a previously set value use the modify command with a new value of -1 for each TRES id.<br>NOTE: This limit is not enforced if set on the root association of a cluster. So even though it may appear in sacctmgr output, it will not be enforced.<br>ALSO NOTE: This limit only applies when using the Priority Multifactor plugin. The time is decayed using the value of PriorityDecayHalfLife or PriorityUsageResetPeriod as set in the slurm.conf. When this limit is reached all associated jobs running will be killed and all future jobs submitted with associations in the group will be delayed until they are able to run inside the limit.<br></blockquote><br>Can you please confirm that you have configured the "Priority Multifactor" plugin?<br></div></div></blockquote><div>Here’s relevant items from slurm.conf</div><div><br></div><div><br></div><div><div style="margin: 0in; font-size: 8.5pt; font-family: Menlo;"><span class="s1"># Activate the Multifactor Job Priority Plugin with decay</span><o:p></o:p></div><div style="margin: 0in; font-size: 8.5pt; font-family: Menlo;"><span class="s1">PriorityType=priority/multifactor</span><o:p></o:p></div><div style="margin: 0in; font-size: 8.5pt; font-family: Menlo;"><o:p> </o:p></div><div style="margin: 0in; font-size: 8.5pt; font-family: Menlo;"><span class="s1"># apply no decay</span><o:p></o:p></div><div style="margin: 0in; font-size: 8.5pt; font-family: Menlo;"><span class="s1">PriorityDecayHalfLife=0</span><o:p></o:p></div><div style="margin: 0in; font-size: 8.5pt; font-family: Menlo;"><o:p> </o:p></div><div style="margin: 0in; font-size: 8.5pt; font-family: Menlo;"><span class="s1"># reset usage after 1 month</span><o:p></o:p></div><div style="margin: 0in; font-size: 8.5pt; font-family: Menlo;"><span class="s1">PriorityUsageResetPeriod=MONTHLY</span><o:p></o:p></div><div style="margin: 0in; font-size: 8.5pt; font-family: Menlo;"><o:p> </o:p></div><div style="margin: 0in; font-size: 8.5pt; font-family: Menlo;"><span class="s1"># The larger the job, the greater its job size priority.</span><o:p></o:p></div><div style="margin: 0in; font-size: 8.5pt; font-family: Menlo;"><span class="s1">PriorityFavorSmall=NO</span><o:p></o:p></div><div style="margin: 0in; font-size: 8.5pt; font-family: Menlo;"><o:p> </o:p></div><div style="margin: 0in; font-size: 8.5pt; font-family: Menlo;"><span class="s1"># The job's age factor reaches 1.0 after waiting in the</span><o:p></o:p></div><div style="margin: 0in; font-size: 8.5pt; font-family: Menlo;"><span class="s1"># queue for 2 weeks.</span><o:p></o:p></div><div style="margin: 0in; font-size: 8.5pt; font-family: Menlo;"><span class="s1">PriorityMaxAge=14-0</span><o:p></o:p></div><div style="margin: 0in; font-size: 8.5pt; font-family: Menlo;"><o:p> </o:p></div><div style="margin: 0in; font-size: 8.5pt; font-family: Menlo;"><span class="s1"># This next group determines the weighting of each of the</span><o:p></o:p></div><div style="margin: 0in; font-size: 8.5pt; font-family: Menlo;"><span class="s1"># components of the Multifactor Job Priority Plugin.</span><o:p></o:p></div><div style="margin: 0in; font-size: 8.5pt; font-family: Menlo;"><span class="s1"># The default value for each of the following is 1.</span><o:p></o:p></div><div style="margin: 0in; font-size: 8.5pt; font-family: Menlo;"><span class="s1">PriorityWeightAge=1000</span><o:p></o:p></div><div style="margin: 0in; font-size: 8.5pt; font-family: Menlo;"><span class="s1">PriorityWeightFairshare=10000</span><o:p></o:p></div><div style="margin: 0in; font-size: 8.5pt; font-family: Menlo;"><span class="s1">PriorityWeightJobSize=1000</span><o:p></o:p></div><div style="margin: 0in; font-size: 8.5pt; font-family: Menlo;"><span class="s1">PriorityWeightPartition=1000</span><o:p></o:p></div><div style="margin: 0in; font-size: 8.5pt; font-family: Menlo;"><span class="s1">PriorityWeightQOS=0 # don't use the qos factor</span></div><div style="margin: 0in; font-size: 8.5pt; font-family: Menlo;"><span class="s1"><br></span></div><div style="margin: 0in; font-size: 8.5pt; font-family: Menlo;"><span class="s1"><br></span></div></div><blockquote type="cite"><div><div><br>Your jobs should not be able to start if the user's GrpTRESMins has been exceeded. Hence they won't be killed!<br></div></div></blockquote><div><br></div>Yes, this works fine<br><blockquote type="cite"><div><div><br>Can you explain step by step what you observe? It may be that the above documentation of killing jobs is in error, in which case we should make a bug report to SchedMD.<br></div></div></blockquote><div><br></div>I set the GrpTRESMins limit to a very small number and then ran a sleep job that exceeded the limit. The job continued to run past the limits until I killed it. It was the only job in the queue. And if it makes any difference, this testing is being done in AWS on a parallel cluster.<br><blockquote type="cite"><div><div><br>/Ole</div></div></blockquote></div></div></blockquote></div><br></div></body></html>