<div dir="ltr"><div class="gmail_default" style="font-family:arial,helvetica,sans-serif"><span style="font-family:Arial,Helvetica,sans-serif">Hi Chris, all,</span></div><div class="gmail_default" style="font-family:arial,helvetica,sans-serif"><span style="font-family:Arial,Helvetica,sans-serif"><br></span></div>We've been having<span class="gmail_default" style="font-family:arial,helvetica,sans-serif"> similar</span> issues, seemingly since upgrading to Slurm 22.05.x, where job steps in batch jobs submitted from interactive sessions fail sporadically<span class="gmail_default" style="font-family:arial,helvetica,sans-serif">:</span><br><span class="gmail_default" style="font-family:arial,helvetica,sans-serif"><br></span><div><span class="gmail_default" style="font-family:arial,helvetica,sans-serif">1. </span>User SSHs to login node.<div><span class="gmail_default" style="font-family:arial,helvetica,sans-serif">2. </span>User runs 'srun --pty /bin/bash' to get an interactive session on a worker node<br><span class="gmail_default" style="font-family:arial,helvetica,sans-serif"></span><span style="font-family:arial,helvetica,sans-serif"><span class="gmail_default" style="font-family:arial,helvetica,sans-serif">3</span>.</span><span style="font-family:arial,helvetica,sans-serif"> </span><span class="gmail_default" style="font-family:arial,helvetica,sans-serif"></span>From that interactive session the user submits a batch job containing >=1 explicit job step</div><div><span class="gmail_default" style="font-family:arial,helvetica,sans-serif">4. </span>The job step then <span class="gmail_default" style="font-family:arial,helvetica,sans-serif">_</span>might<span class="gmail_default" style="font-family:arial,helvetica,sans-serif">_</span> fail with something like:</div><div><br>    srun: error: CPU binding outside of job step allocation, allocated CPUs are: 0x2.<br>    srun: error: Task launch for StepId=372.0 failed on node px01: Unable to satisfy cpu bind request<br>    srun: error: Application launch failed: Unable to satisfy cpu bind request<br>    srun: Job step aborted<br><br></div><div>This seems to be due to SLURM_CPU_BIND_* env vars being set in the interactive job, which then (undesirably) propagate to the batch job and cause problems if the job's taskset conflicts with the inherited SLURM_CPU_BIND_* values.<br><br></div><div>Unsetting those env vars at the top of the job submission script seems to prevent the issue from occurring, but isn't something we want to recommend to users.  Also, we're concerned that propagation of other env vars from the interactive job to the batch might cause other issues.<br><br></div><div>We thought that SLURM_EXPORT_ENV / SBATCH_EXPORT could help here but the docs for those features say<span class="gmail_default" style="font-family:arial,helvetica,sans-serif">: "</span>Note that SLURM_* variables are always propagated.<span class="gmail_default" style="font-family:arial,helvetica,sans-serif">"</span></div><div><span class="gmail_default" style="font-family:arial,helvetica,sans-serif"></span><br></div><div>Has anything changed in 22.05 that could explain this?  The only relevant things I can spot in the changelog that might be related are:</div><div><br> -- Fail srun when using invalid `--cpu-bind` options (e.g. `--cpu-bind=map_cpu:99` when only 10 cpus are allocated).<br> -- `srun --overlap` now allows the step to share all resources (CPUs, memory, and GRES), where previously `--overlap` only allowed the step to share CPUs with other steps.<br><div class="gmail_default" style="font-family:arial,helvetica,sans-serif"><span style="font-family:Arial,Helvetica,sans-serif"><br></span></div><div class="gmail_default" style="font-family:arial,helvetica,sans-serif"><span style="font-family:Arial,Helvetica,sans-serif">NB this has also been discussed on the Slurm Bugzilla (<a href="https://bugs.schedmd.com/show_bug.cgi?id=14298">https://bugs.schedmd.com/show_bug.cgi?id=14298</a>).</span></div><div class="gmail_default" style="font-family:arial,helvetica,sans-serif"><span style="font-family:Arial,Helvetica,sans-serif"><br></span></div><div class="gmail_default" style="font-family:arial,helvetica,sans-serif"><span style="font-family:Arial,Helvetica,sans-serif">Regards,</span></div><div class="gmail_default" style="font-family:arial,helvetica,sans-serif"><span style="font-family:Arial,Helvetica,sans-serif"><br></span></div><div class="gmail_default" style="font-family:arial,helvetica,sans-serif"><span style="font-family:Arial,Helvetica,sans-serif">Will</span></div><div class="gmail_default" style="font-family:arial,helvetica,sans-serif"><br></div></div></div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Fri, 10 Jun 2022 at 14:55, Rutledge, Chris <<a href="mailto:crutledge@renci.org">crutledge@renci.org</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">Hello Everyone,<br>
<br>
Having an odd issue with the latest version of slurm (22.05.0) when submitting jobs to the queue while on a compute resource. Some jobs are unable to reproduce this issue every time, but I've got a few that will. Here's one case that consistently errors when trying to launch. I've not been able to reproduce the issue when submitting jobs from the login node.<br>
<br>
Anyone seen anything like this?<br>
<br>
##############################<br>
# start interactive session<br>
##############################<br>
[crutledge@ht1 ~]$ /usr/bin/srun --pty /bin/bash -i -l<br>
[crutledge@largemem-5-1 ~]$ cd hpcc/bin/gpu-6/<br>
<br>
##############################<br>
# job details<br>
##############################<br>
[crutledge@largemem-5-1 gpu-6]$ cat job <br>
#!/bin/bash -l<br>
#<br>
#SBATCH --job-name=HPCC<br>
#SBATCH -n 48<br>
#SBATCH -p gpu<br>
#SBATCH --mem-per-cpu=3975<br>
<br>
module load icc/2022.0.2 env_icc/any mvapich2/2.3.7-intel<br>
<br>
srun ./hpcc<br>
<br>
mv hpccoutf.txt hpccoutf.txt.${SLURM_JOB_ID}<br>
<br>
##############################<br>
# submit the job<br>
##############################<br>
[crutledge@largemem-5-1 gpu-6]$ sbatch job<br>
Submitted batch job 8533<br>
<br>
##############################<br>
# resulting error<br>
##############################<br>
[crutledge@largemem-5-1 gpu-6]$ cat slurm-8533.out <br>
Loading icc version 2022.0.2<br>
Loading compiler-rt version 2022.0.2<br>
srun: error: CPU binding outside of job step allocation, allocated CPUs are: 0x000000000001000000000001.<br>
srun: error: Task launch for StepId=8533.0 failed on node gpu-5-2: Unable to satisfy cpu bind request<br>
srun: error: Application launch failed: Unable to satisfy cpu bind request<br>
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.<br>
slurmstepd: error: *** STEP 8533.0 ON gpu-5-1 CANCELLED AT 2022-06-10T09:38:19 ***<br>
srun: error: gpu-5-1: tasks 0-46: Killed<br>
mv: cannot stat ‘hpccoutf.txt’: No such file or directory<br>
[crutledge@largemem-5-1 gpu-6]$<br>
</blockquote></div><br clear="all"><div><br></div>-- <br><div dir="ltr" class="gmail_signature"><div dir="ltr">Dr Will Furnass | Research Platforms Engineer<div>IT Services | University of Sheffield </div><div>+44 (0)114 22 29693 </div></div></div>