<html><head><meta http-equiv="Content-Type" content="text/html charset=us-ascii"></head><body style="word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space;" class=""><div class="">Config details:</div><div class=""><br class=""></div><blockquote style="margin: 0 0 0 40px; border: none; padding: 0px;" class="">- Slurm v17.11.8<div class="">- QOS-based preemption</div><div class="">- Backfill scheduler (default parameters)</div><div class="">- QOS:</div></blockquote><blockquote style="margin: 0 0 0 40px; border: none; padding: 0px;" class=""><blockquote style="margin: 0 0 0 40px; border: none; padding: 0px;" class=""><div class="">- "normal" = PreemptMode=CANCEL, GraceTime=5 minutes</div></blockquote><blockquote style="margin: 0 0 0 40px; border: none; padding: 0px;" class=""><div class="">- Per-stakeholder = Preempt=normal GrpTRES=<limits></div></blockquote></blockquote><blockquote style="margin: 0 0 0 40px; border: none; padding: 0px;" class=""><div class="">- Partitions:</div></blockquote><blockquote style="margin: 0 0 0 40px; border: none; padding: 0px;" class=""><blockquote style="margin: 0 0 0 40px; border: none; padding: 0px;" class=""><div class="">- "standard" (default) = QOS=normal</div></blockquote><blockquote style="margin: 0 0 0 40px; border: none; padding: 0px;" class=""><div class="">- Per-stakeholder = QOS=<stakeholder-qos></div></blockquote></blockquote><div class=""><br class=""></div><div class="">When users need priority access to purchased hardware, they submit to a stakeholder partition; jobs in stakeholder partitions can preempt jobs in the opportunistic "standard" partition.</div><div class=""><br class=""></div><div class=""><br class=""></div><div class="">Problem 1:  Preemption not logged</div><div class="">=================================</div><div class=""><br class=""></div><div class="">We've had a number of users reporting jobs that fail before they reach their TimeLimit.  No mention of preemption in the slurmd/slurmctld logs, but the EndTime has been altered from its original value and job step(s) end up with FAILED status:</div><div class=""><br class=""></div><blockquote style="margin: 0 0 0 40px; border: none; padding: 0px;" class="">       JobID    JobName  Partition    Account  AllocCPUS      State ExitCode <br class="">------------ ---------- ---------- ---------- ---------- ---------- -------- <br class="">237755       ewb0.55_b+   standard XXXXXXXXXX         72     FAILED     77:0 <br class="">237755.batch      batch            XXXXXXXXXX         36     FAILED     77:0 <br class="">237755.exte+     extern            XXXXXXXXXX         72  COMPLETED      0:0 <br class="">237755.0          orted            XXXXXXXXXX          1     FAILED      1:0 </blockquote><div class=""><br class=""></div><div class=""><div class="">The slurm-237755.out shows the MPI runtime lost contact with the remote orted daemon (237755.0) and failed.  I think I've traced this through the Slurm source far enough to determine that with a non-zero GraceTime, the EndTime is altered and the SIGCONT-SIGTERM pair are sent directly by slurmctld (no SIG_PREEMPTED, see slurm_job_check_grace() and _preempt_signal() in slurmctld/preempt.c); slurmctld logs nothing about the preemption at this point outside debug levels.  The signal is not caught by orted, so the 237755.0 step dies and logs in slurmd:</div><div class=""><br class=""></div></div><blockquote style="margin: 0 0 0 40px; border: none; padding: 0px;" class=""><div class=""><div class="">[237755.0] error: *** STEP 237755.0 ON r00n23 CANCELLED AT 2019-01-15T11:20:42 ***</div></div></blockquote><div class=""><div class=""><br class="webkit-block-placeholder"></div><div class="">because the signal in question was SIGTERM (see slurmd/slurmstepd/req.c).  This causes the MPI runtime to exit and the batch script returns in error.  Since GraceTime hasn't expired, slurmctld logs failure instead of preemption.  The only indicator of preemption is the altered EndTime (relative to its starting value) and a job from a per-stakeholder partition (which can preempt "standard") starting on the nodes _immediately_ following the death of 237755:</div><div class=""><br class=""></div></div><blockquote style="margin: 0 0 0 40px; border: none; padding: 0px;" class=""><div class=""><div class="">[2019-01-15T05:16:12.894] _slurm_rpc_submit_batch_job: JobId=237755 InitPrio=2275 usec=5960</div></div><div class=""><div class="">[2019-01-15T09:11:19.367] sched: Allocate JobID=237755 NodeList=r00n[21,23] #CPUs=72 Partition=standard</div></div><div class=""><div class="">[2019-01-15T11:20:41.430] _slurm_rpc_submit_batch_job: JobId=238396 InitPrio=22402 usec=6505</div></div><div class=""><div class="">[2019-01-15T11:20:43.594] _job_complete: JobID=237755 State=0x1 NodeCnt=2 WEXITSTATUS 205</div></div><div class=""><div class="">[2019-01-15T11:20:43.594] email msg to XXXXXXXXXXXXXX: SLURM Job_id=237755 Name=ewb0.55_b96sc00_a Failed, Run time 02:09:24, FAILED, ExitCode 205</div></div><div class=""><div class="">[2019-01-15T11:20:43.595] _job_complete: JobID=237755 State=0x8005 NodeCnt=2 done</div></div><div class=""><div class="">[2019-01-15T11:20:44.606] email msg to XXXXXXXXXXXXXX: SLURM Job_id=238396 Name=Pt111_sol Began, Queued time 00:00:03</div></div><div class=""><div class="">[2019-01-15T11:20:44.606] sched: Allocate JobID=238396 NodeList=r00n[21,23] #CPUs=72 Partition=ccei_biomass</div></div><div class=""><div class="">[2019-01-15T14:13:55.232] _job_complete: JobID=238396 State=0x1 NodeCnt=2 WEXITSTATUS 0</div></div><div class=""><div class="">[2019-01-15T14:13:55.237] _job_complete: JobID=238396 State=0x8003 NodeCnt=2 done</div></div></blockquote><div class=""><div class=""><br class="webkit-block-placeholder"></div><div class="">In contrast, jobs that ignore/catch SIGCONT-SIGTERM and keep running through the 5 minute GraceTime are then killed and their overall state is logged as being preempted:</div><div class=""><br class=""></div></div><blockquote style="margin: 0 0 0 40px; border: none; padding: 0px;" class=""><div class=""><div class="">       JobID    JobName  Partition    Account  AllocCPUS      State ExitCode </div></div><div class=""><div class="">------------ ---------- ---------- ---------- ---------- ---------- -------- </div></div><div class=""><div class="">412845        TTC100um1   standard XXXXXXXXXX         36  PREEMPTED      0:0 </div></div><div class=""><div class="">412845.batch      batch            XXXXXXXXXX         36  CANCELLED     0:15 </div></div><div class=""><div class="">412845.exte+     extern            XXXXXXXXXX         36  COMPLETED      0:0 </div></div></blockquote><div class=""><div class=""><br class=""></div><div class="">So any job that catches SIGCONT-SIGTERM, gracefully ends its work, and exits before GraceTime expires would be logged as COMPLETED?  Is this behavior inline with anyone else's experience?</div><div class=""><br class=""></div><div class=""><br class=""></div><div class=""><br class=""></div><div class="">Problem 2:  Backfill of preempted node(s)</div><div class="">=========================================</div><div class=""><br class=""></div><div class="">This one was especially confounding:</div><div class=""><br class=""></div></div><blockquote style="margin: 0 0 0 40px; border: none; padding: 0px;" class=""><div class=""><div class="">[2019-03-07T23:08:43.769] _slurm_rpc_submit_batch_job: JobId=410289 InitPrio=2997 usec=5694</div></div><div class=""><div class="">[2019-03-07T23:08:46.215] udhpc: setting job account to safrono (1040)</div></div><div class=""><div class="">[2019-03-07T23:08:46.215] udhpc: time_min is empty, setting to time_limit</div></div><div class=""><div class="">   :</div></div><div class=""><div class="">[2019-03-08T00:28:04.364] backfill: Started JobID=410289 in standard on r00n29</div></div><div class=""><div class="">[2019-03-08T00:32:43.490] job_time_limit: Preemption GraceTime reached JobId=409950</div></div><div class=""><div class="">[2019-03-08T00:32:43.501] job_time_limit: Preemption GraceTime reached JobId=409951</div></div><div class=""><div class="">[2019-03-08T00:32:43.503] job_time_limit: Preemption GraceTime reached JobId=409952</div></div><div class=""><div class="">[2019-03-08T00:32:43.505] job_time_limit: Preemption GraceTime reached JobId=409953</div></div><div class=""><div class="">[2019-03-08T00:32:43.507] job_time_limit: Preemption GraceTime reached JobId=409954</div></div><div class=""><div class="">[2019-03-08T00:32:43.509] job_time_limit: Preemption GraceTime reached JobId=409955</div></div><div class=""><div class="">[2019-03-08T00:32:43.511] job_time_limit: Preemption GraceTime reached JobId=409956</div></div><div class=""><div class="">[2019-03-08T00:32:43.513] job_time_limit: Preemption GraceTime reached JobId=409957</div></div><div class=""><div class="">[2019-03-08T00:32:43.515] job_time_limit: Preemption GraceTime reached JobId=409958</div></div><div class=""><div class="">[2019-03-08T00:32:43.517] job_time_limit: Preemption GraceTime reached JobId=409959</div></div><div class=""><div class="">[2019-03-08T00:32:43.519] job_time_limit: Preemption GraceTime reached JobId=409960</div></div><div class=""><div class="">[2019-03-08T00:32:43.520] job_time_limit: Preemption GraceTime reached JobId=409961</div></div><div class=""><div class="">[2019-03-08T00:32:43.522] job_time_limit: Preemption GraceTime reached JobId=409962</div></div><div class=""><div class="">[2019-03-08T00:32:43.524] job_time_limit: Preemption GraceTime reached JobId=409963</div></div><div class=""><div class="">[2019-03-08T00:32:43.526] job_time_limit: Preemption GraceTime reached JobId=409964</div></div><div class=""><div class="">[2019-03-08T00:32:43.527] job_time_limit: Preemption GraceTime reached JobId=409965</div></div><div class=""><div class="">[2019-03-08T00:32:43.529] job_time_limit: Preemption GraceTime reached JobId=409966</div></div><div class=""><div class="">[2019-03-08T00:32:43.531] job_time_limit: Preemption GraceTime reached JobId=409967</div></div><div class=""><div class="">[2019-03-08T00:32:43.532] job_time_limit: Preemption GraceTime reached JobId=409968</div></div><div class=""><div class="">[2019-03-08T00:32:43.534] job_time_limit: Preemption GraceTime reached JobId=409969</div></div><div class=""><div class="">[2019-03-08T00:32:43.535] job_time_limit: Preemption GraceTime reached JobId=409970</div></div><div class=""><div class="">[2019-03-08T00:32:43.537] job_time_limit: Preemption GraceTime reached JobId=409971</div></div><div class=""><div class="">[2019-03-08T00:32:43.538] job_time_limit: Preemption GraceTime reached JobId=410003</div></div><div class=""><div class="">[2019-03-08T00:32:46.226] _job_complete: JobID=410289 State=0x1 NodeCnt=1 WEXITSTATUS 1</div></div><div class=""><div class="">[2019-03-08T00:32:46.230] _job_complete: JobID=410289 State=0x8005 NodeCnt=1 done</div></div><div class=""><div class="">[2019-03-08T00:32:46.310] email msg to <a href="mailto:linalee@udel.edu" class="">linalee@udel.edu</a>: SLURM Job_id=410300 Name=mesh_12x12x4_T3 Began, Queued time 00:05:17</div></div><div class=""><div class="">[2019-03-08T00:32:46.350] sched: Allocate JobID=410300 NodeList=r00n29 #CPUs=36 Partition=ccei_biomass</div></div></blockquote><div class=""><div class=""><br class="webkit-block-placeholder"></div><div class="">Job 410300 was in a per-stakeholder partition, thus preempting jobs in the "standard" partition.  For GraceTime of 5 minutes, the preemption occurred ca. 00:27:43.  So how did the backfill scheduler manage to start a new job (410289) on that node at 00:28:04, _after_ the preemption state should have been noted on r00n29?  The submitted TimeLimit on 410289 was 7 days, yet again its EndTime was altered to the exact time it died, and it died the same way as in Problem 1:  initial SIGCONT-SIGTERM delivery killed step 0, etc.  With EndTime reset to 00:32:46, the preemption would have had to happen ca. 00:27:46.  Seems like a race condition.</div><div class=""><br class="webkit-block-placeholder"></div><div class=""><br class="webkit-block-placeholder"></div><div class="">Release notes don't seem to indicate that these issues were known and have been addressed, and I didn't find anything on <a href="http://bugs.schedmd.com" class="">bugs.schedmd.com</a> that seemed to pertain.  To anyone who has encountered like/similar behavior, how did you mitigate it?</div><div class=""><br class=""></div><div class=""><br class=""></div><div class=""><br class=""></div><div class=""><br class="">::::::::::::::::::::::::::::::::::::::::::::::::::::::<br class="">Jeffrey T. Frey, Ph.D.<br class="">Systems Programmer V / HPC Management<br class="">Network & Systems Services / College of Engineering<br class="">University of Delaware, Newark DE  19716<br class="">Office: (302) 831-6034  Mobile: (302) 419-4976<br class="">::::::::::::::::::::::::::::::::::::::::::::::::::::::<br class=""><br class=""><br class=""><br class=""></div><br class=""></div></body></html>