[slurm-users] Preemption vs. backfill

Wed Mar 20 13:58:21 UTC 2019

Config details:

- Slurm v17.11.8
- QOS-based preemption
- Backfill scheduler (default parameters)
- QOS:
- "normal" = PreemptMode=CANCEL, GraceTime=5 minutes
- Per-stakeholder = Preempt=normal GrpTRES=<limits>
- Partitions:
- "standard" (default) = QOS=normal
- Per-stakeholder = QOS=<stakeholder-qos>

When users need priority access to purchased hardware, they submit to a stakeholder partition; jobs in stakeholder partitions can preempt jobs in the opportunistic "standard" partition.

Problem 1:  Preemption not logged
=================================

We've had a number of users reporting jobs that fail before they reach their TimeLimit.  No mention of preemption in the slurmd/slurmctld logs, but the EndTime has been altered from its original value and job step(s) end up with FAILED status:

       JobID    JobName  Partition    Account  AllocCPUS      State ExitCode 
------------ ---------- ---------- ---------- ---------- ---------- -------- 
237755       ewb0.55_b+   standard XXXXXXXXXX         72     FAILED     77:0 
237755.batch      batch            XXXXXXXXXX         36     FAILED     77:0 
237755.exte+     extern            XXXXXXXXXX         72  COMPLETED      0:0 
237755.0          orted            XXXXXXXXXX          1     FAILED      1:0 

The slurm-237755.out shows the MPI runtime lost contact with the remote orted daemon (237755.0) and failed.  I think I've traced this through the Slurm source far enough to determine that with a non-zero GraceTime, the EndTime is altered and the SIGCONT-SIGTERM pair are sent directly by slurmctld (no SIG_PREEMPTED, see slurm_job_check_grace() and _preempt_signal() in slurmctld/preempt.c); slurmctld logs nothing about the preemption at this point outside debug levels.  The signal is not caught by orted, so the 237755.0 step dies and logs in slurmd:

[237755.0] error: *** STEP 237755.0 ON r00n23 CANCELLED AT 2019-01-15T11:20:42 ***

because the signal in question was SIGTERM (see slurmd/slurmstepd/req.c).  This causes the MPI runtime to exit and the batch script returns in error.  Since GraceTime hasn't expired, slurmctld logs failure instead of preemption.  The only indicator of preemption is the altered EndTime (relative to its starting value) and a job from a per-stakeholder partition (which can preempt "standard") starting on the nodes _immediately_ following the death of 237755:

[2019-01-15T05:16:12.894] _slurm_rpc_submit_batch_job: JobId=237755 InitPrio=2275 usec=5960
[2019-01-15T09:11:19.367] sched: Allocate JobID=237755 NodeList=r00n[21,23] #CPUs=72 Partition=standard
[2019-01-15T11:20:41.430] _slurm_rpc_submit_batch_job: JobId=238396 InitPrio=22402 usec=6505
[2019-01-15T11:20:43.594] _job_complete: JobID=237755 State=0x1 NodeCnt=2 WEXITSTATUS 205
[2019-01-15T11:20:43.594] email msg to XXXXXXXXXXXXXX: SLURM Job_id=237755 Name=ewb0.55_b96sc00_a Failed, Run time 02:09:24, FAILED, ExitCode 205
[2019-01-15T11:20:43.595] _job_complete: JobID=237755 State=0x8005 NodeCnt=2 done
[2019-01-15T11:20:44.606] email msg to XXXXXXXXXXXXXX: SLURM Job_id=238396 Name=Pt111_sol Began, Queued time 00:00:03
[2019-01-15T11:20:44.606] sched: Allocate JobID=238396 NodeList=r00n[21,23] #CPUs=72 Partition=ccei_biomass
[2019-01-15T14:13:55.232] _job_complete: JobID=238396 State=0x1 NodeCnt=2 WEXITSTATUS 0
[2019-01-15T14:13:55.237] _job_complete: JobID=238396 State=0x8003 NodeCnt=2 done

In contrast, jobs that ignore/catch SIGCONT-SIGTERM and keep running through the 5 minute GraceTime are then killed and their overall state is logged as being preempted:

       JobID    JobName  Partition    Account  AllocCPUS      State ExitCode 
------------ ---------- ---------- ---------- ---------- ---------- -------- 
412845        TTC100um1   standard XXXXXXXXXX         36  PREEMPTED      0:0 
412845.batch      batch            XXXXXXXXXX         36  CANCELLED     0:15 
412845.exte+     extern            XXXXXXXXXX         36  COMPLETED      0:0 

So any job that catches SIGCONT-SIGTERM, gracefully ends its work, and exits before GraceTime expires would be logged as COMPLETED?  Is this behavior inline with anyone else's experience?

Problem 2:  Backfill of preempted node(s)
=========================================

This one was especially confounding:

[2019-03-07T23:08:43.769] _slurm_rpc_submit_batch_job: JobId=410289 InitPrio=2997 usec=5694
[2019-03-07T23:08:46.215] udhpc: setting job account to safrono (1040)
[2019-03-07T23:08:46.215] udhpc: time_min is empty, setting to time_limit
   :
[2019-03-08T00:28:04.364] backfill: Started JobID=410289 in standard on r00n29
[2019-03-08T00:32:43.490] job_time_limit: Preemption GraceTime reached JobId=409950
[2019-03-08T00:32:43.501] job_time_limit: Preemption GraceTime reached JobId=409951
[2019-03-08T00:32:43.503] job_time_limit: Preemption GraceTime reached JobId=409952
[2019-03-08T00:32:43.505] job_time_limit: Preemption GraceTime reached JobId=409953
[2019-03-08T00:32:43.507] job_time_limit: Preemption GraceTime reached JobId=409954
[2019-03-08T00:32:43.509] job_time_limit: Preemption GraceTime reached JobId=409955
[2019-03-08T00:32:43.511] job_time_limit: Preemption GraceTime reached JobId=409956
[2019-03-08T00:32:43.513] job_time_limit: Preemption GraceTime reached JobId=409957
[2019-03-08T00:32:43.515] job_time_limit: Preemption GraceTime reached JobId=409958
[2019-03-08T00:32:43.517] job_time_limit: Preemption GraceTime reached JobId=409959
[2019-03-08T00:32:43.519] job_time_limit: Preemption GraceTime reached JobId=409960
[2019-03-08T00:32:43.520] job_time_limit: Preemption GraceTime reached JobId=409961
[2019-03-08T00:32:43.522] job_time_limit: Preemption GraceTime reached JobId=409962
[2019-03-08T00:32:43.524] job_time_limit: Preemption GraceTime reached JobId=409963
[2019-03-08T00:32:43.526] job_time_limit: Preemption GraceTime reached JobId=409964
[2019-03-08T00:32:43.527] job_time_limit: Preemption GraceTime reached JobId=409965
[2019-03-08T00:32:43.529] job_time_limit: Preemption GraceTime reached JobId=409966
[2019-03-08T00:32:43.531] job_time_limit: Preemption GraceTime reached JobId=409967
[2019-03-08T00:32:43.532] job_time_limit: Preemption GraceTime reached JobId=409968
[2019-03-08T00:32:43.534] job_time_limit: Preemption GraceTime reached JobId=409969
[2019-03-08T00:32:43.535] job_time_limit: Preemption GraceTime reached JobId=409970
[2019-03-08T00:32:43.537] job_time_limit: Preemption GraceTime reached JobId=409971
[2019-03-08T00:32:43.538] job_time_limit: Preemption GraceTime reached JobId=410003
[2019-03-08T00:32:46.226] _job_complete: JobID=410289 State=0x1 NodeCnt=1 WEXITSTATUS 1
[2019-03-08T00:32:46.230] _job_complete: JobID=410289 State=0x8005 NodeCnt=1 done
[2019-03-08T00:32:46.310] email msg to linalee at udel.edu: SLURM Job_id=410300 Name=mesh_12x12x4_T3 Began, Queued time 00:05:17
[2019-03-08T00:32:46.350] sched: Allocate JobID=410300 NodeList=r00n29 #CPUs=36 Partition=ccei_biomass

Job 410300 was in a per-stakeholder partition, thus preempting jobs in the "standard" partition.  For GraceTime of 5 minutes, the preemption occurred ca. 00:27:43.  So how did the backfill scheduler manage to start a new job (410289) on that node at 00:28:04, _after_ the preemption state should have been noted on r00n29?  The submitted TimeLimit on 410289 was 7 days, yet again its EndTime was altered to the exact time it died, and it died the same way as in Problem 1:  initial SIGCONT-SIGTERM delivery killed step 0, etc.  With EndTime reset to 00:32:46, the preemption would have had to happen ca. 00:27:46.  Seems like a race condition.

Release notes don't seem to indicate that these issues were known and have been addressed, and I didn't find anything on bugs.schedmd.com <http://bugs.schedmd.com/> that seemed to pertain.  To anyone who has encountered like/similar behavior, how did you mitigate it?

::::::::::::::::::::::::::::::::::::::::::::::::::::::
Jeffrey T. Frey, Ph.D.
Systems Programmer V / HPC Management
Network & Systems Services / College of Engineering
University of Delaware, Newark DE  19716
Office: (302) 831-6034  Mobile: (302) 419-4976
::::::::::::::::::::::::::::::::::::::::::::::::::::::

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20190320/3294ea79/attachment.html>