[slurm-users] [External] Preemption not working in 20.11

Michael Robbert mrobbert at mines.edu
Fri Feb 26 20:58:53 UTC 2021


We saw something that sounds similar to this. See this bug report: https://bugs.schedmd.com/show_bug.cgi?id=10196

SchedMD never found the root cause. They thought it might have something to do with a timing problem on Prolog scripts, but the thing that fixed it for us was to set GraceTime=0 on our preemptable QoS.

 

Mike Robbert

Cyberinfrastructure Specialist, Cyberinfrastructure and Advanced Research Computing

Information and Technology Solutions (ITS)

303-273-3786 | mrobbert at mines.edu  

Our values: Trust | Integrity | Respect | Responsibility

 

From: slurm-users <slurm-users-bounces at lists.schedmd.com> on behalf of Prentice Bisbal <pbisbal at pppl.gov>
Reply-To: Slurm User Community List <slurm-users at lists.schedmd.com>
Date: Friday, February 26, 2021 at 12:38
To: "slurm-users at lists.schedmd.com" <slurm-users at lists.schedmd.com>
Subject: [External] [slurm-users] Preemption not working in 20.11

 

CAUTION: This email originated from outside of the Colorado School of Mines organization. Do not click on links or open attachments unless you recognize the sender and know the content is safe.

 

We recently upgraded from Slurm 19.05.8 to 20.11.3. In our configuration, we have an interruptible partition named 'interruptible' for long-running, low-priority jobs that use checkpoint/restart. Jobs that are preempted would be killed and requeued rather than suspended. This configuration has been working without issue for 2+ years without issue. 

After the upgrade, this has stopped working. Preempted jobs are killed and not requeued. My slurm.conf file is configured to requeue preempted jobs:

$ grep -i requeue /etc/slurm/slurm.conf 
#JobRequeue=1
PreemptMode=Requeue

And the user's sbatch script included the --requeue option. 

The user reports the err output from his preempted jobs now says

slurmstepd: error: *** STEP 1075117.0 ON greene002 CANCELLED AT 2021-02-25T16:07:48 ***

And in the past it would see PREEMPTED instead of cancelled. 


Any ideas what would cause this? I've reported this to Slurm support, and haven't gotten anything back yet, so I figured I'd ask here, too. If this is a bug, I can't be the only one who has experienced this. 

-- 
Prentice 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20210226/af218954/attachment-0001.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image001.png
Type: image/png
Size: 8292 bytes
Desc: not available
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20210226/af218954/attachment-0001.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 5173 bytes
Desc: not available
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20210226/af218954/attachment-0001.bin>


More information about the slurm-users mailing list