[slurm-users] Preemption not working in 20.11
Prentice Bisbal
pbisbal at pppl.gov
Fri Feb 26 19:35:53 UTC 2021
We recently upgraded from Slurm 19.05.8 to 20.11.3. In our
configuration, we have an interruptible partition named 'interruptible'
for long-running, low-priority jobs that use checkpoint/restart. Jobs
that are preempted would be killed and requeued rather than suspended.
This configuration has been working without issue for 2+ years without
issue.
After the upgrade, this has stopped working. Preempted jobs are killed
and not requeued. My slurm.conf file is configured to requeue preempted
jobs:
$ grep -i requeue /etc/slurm/slurm.conf
#JobRequeue=1
PreemptMode=Requeue
And the user's sbatch script included the --requeue option.
The user reports the err output from his preempted jobs now says
slurmstepd: error: *** STEP 1075117.0 ON greene002 CANCELLED AT
2021-02-25T16:07:48 ***
And in the past it would see PREEMPTED instead of cancelled.
Any ideas what would cause this? I've reported this to Slurm support,
and haven't gotten anything back yet, so I figured I'd ask here, too. If
this is a bug, I can't be the only one who has experienced this.
--
Prentice
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20210226/07b59a63/attachment.htm>
More information about the slurm-users
mailing list