[slurm-users] [External] Preemption not working in 20.11
Prentice Bisbal
pbisbal at pppl.gov
Mon Mar 1 22:00:29 UTC 2021
Thanks for the info and link to your bug report. Unfortunately, my
GraceTime is already set to zero for that QOS:
$ sacctmgr show qos interruptible format=Name,gracetime
Name GraceTime
---------- ----------
interrupt+ 00:00:00
On 2/26/21 3:58 PM, Michael Robbert wrote:
>
> We saw something that sounds similar to this. See this bug report:
> https://bugs.schedmd.com/show_bug.cgi?id=10196
> <https://bugs.schedmd.com/show_bug.cgi?id=10196>
>
> SchedMD never found the root cause. They thought it might have
> something to do with a timing problem on Prolog scripts, but the thing
> that fixed it for us was to set GraceTime=0 on our preemptable QoS.
>
> *Mike Robbert*
>
> *Cyberinfrastructure Specialist, Cyberinfrastructure and Advanced
> Research Computing*
>
> Information and Technology Solutions (ITS)
>
> 303-273-3786 | mrobbert at mines.edu <mailto:mrobbert at mines.edu>
>
> A close up of a sign Description automatically generated
>
> *Our values:*Trust | Integrity | Respect | Responsibility
>
> *From: *slurm-users <slurm-users-bounces at lists.schedmd.com> on behalf
> of Prentice Bisbal <pbisbal at pppl.gov>
> *Reply-To: *Slurm User Community List <slurm-users at lists.schedmd.com>
> *Date: *Friday, February 26, 2021 at 12:38
> *To: *"slurm-users at lists.schedmd.com" <slurm-users at lists.schedmd.com>
> *Subject: *[External] [slurm-users] Preemption not working in 20.11
>
> *CAUTION:*This email originated from outside of the Colorado School of
> Mines organization. Do not click on links or open attachments unless
> you recognize the sender and know the content is safe.
>
> We recently upgraded from Slurm 19.05.8 to 20.11.3. In our
> configuration, we have an interruptible partition named
> 'interruptible' for long-running, low-priority jobs that use
> checkpoint/restart. Jobs that are preempted would be killed and
> requeued rather than suspended. This configuration has been working
> without issue for 2+ years without issue.
>
> After the upgrade, this has stopped working. Preempted jobs are killed
> and not requeued. My slurm.conf file is configured to requeue
> preempted jobs:
>
> $ grep -i requeue /etc/slurm/slurm.conf
> #JobRequeue=1
> PreemptMode=Requeue
>
> And the user's sbatch script included the --requeue option.
>
> The user reports the err output from his preempted jobs now says
>
> slurmstepd: error: *** STEP 1075117.0 ON greene002 CANCELLED AT
> 2021-02-25T16:07:48 ***
>
> And in the past it would see PREEMPTED instead of cancelled.
>
> Any ideas what would cause this? I've reported this to Slurm support,
> and haven't gotten anything back yet, so I figured I'd ask here, too.
> If this is a bug, I can't be the only one who has experienced this.
>
> --
> Prentice
--
Prentice Bisbal
Lead Software Engineer
Research Computing
Princeton Plasma Physics Laboratory
http://www.pppl.gov
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20210301/61cbba46/attachment-0001.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image001.png
Type: image/png
Size: 8292 bytes
Desc: not available
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20210301/61cbba46/attachment-0001.png>
More information about the slurm-users
mailing list