[slurm-users] CPU binding outside of job step allocation

Mon Nov 14 16:20:12 UTC 2022

Hi Chris, all,

We've been having similar issues, seemingly since upgrading to Slurm
22.05.x, where job steps in batch jobs submitted from interactive sessions
fail sporadically:

1. User SSHs to login node.
2. User runs 'srun --pty /bin/bash' to get an interactive session on a
worker node
3. From that interactive session the user submits a batch job containing
>=1 explicit job step
4. The job step then _might_ fail with something like:

    srun: error: CPU binding outside of job step allocation, allocated CPUs
are: 0x2.
    srun: error: Task launch for StepId=372.0 failed on node px01: Unable
to satisfy cpu bind request
    srun: error: Application launch failed: Unable to satisfy cpu bind
request
    srun: Job step aborted

This seems to be due to SLURM_CPU_BIND_* env vars being set in the
interactive job, which then (undesirably) propagate to the batch job and
cause problems if the job's taskset conflicts with the inherited
SLURM_CPU_BIND_* values.

Unsetting those env vars at the top of the job submission script seems to
prevent the issue from occurring, but isn't something we want to recommend
to users.  Also, we're concerned that propagation of other env vars from
the interactive job to the batch might cause other issues.

We thought that SLURM_EXPORT_ENV / SBATCH_EXPORT could help here but the
docs for those features say: "Note that SLURM_* variables are always
propagated."

Has anything changed in 22.05 that could explain this?  The only relevant
things I can spot in the changelog that might be related are:

 -- Fail srun when using invalid `--cpu-bind` options (e.g.
`--cpu-bind=map_cpu:99` when only 10 cpus are allocated).
 -- `srun --overlap` now allows the step to share all resources (CPUs,
memory, and GRES), where previously `--overlap` only allowed the step to
share CPUs with other steps.

NB this has also been discussed on the Slurm Bugzilla (
https://bugs.schedmd.com/show_bug.cgi?id=14298).

Regards,

Will

On Fri, 10 Jun 2022 at 14:55, Rutledge, Chris <crutledge at renci.org> wrote:

> Hello Everyone,
>
> Having an odd issue with the latest version of slurm (22.05.0) when
> submitting jobs to the queue while on a compute resource. Some jobs are
> unable to reproduce this issue every time, but I've got a few that will.
> Here's one case that consistently errors when trying to launch. I've not
> been able to reproduce the issue when submitting jobs from the login node.
>
> Anyone seen anything like this?
>
> ##############################
> # start interactive session
> ##############################
> [crutledge at ht1 ~]$ /usr/bin/srun --pty /bin/bash -i -l
> [crutledge at largemem-5-1 ~]$ cd hpcc/bin/gpu-6/
>
> ##############################
> # job details
> ##############################
> [crutledge at largemem-5-1 gpu-6]$ cat job
> #!/bin/bash -l
> #
> #SBATCH --job-name=HPCC
> #SBATCH -n 48
> #SBATCH -p gpu
> #SBATCH --mem-per-cpu=3975
>
> module load icc/2022.0.2 env_icc/any mvapich2/2.3.7-intel
>
> srun ./hpcc
>
> mv hpccoutf.txt hpccoutf.txt.${SLURM_JOB_ID}
>
> ##############################
> # submit the job
> ##############################
> [crutledge at largemem-5-1 gpu-6]$ sbatch job
> Submitted batch job 8533
>
> ##############################
> # resulting error
> ##############################
> [crutledge at largemem-5-1 gpu-6]$ cat slurm-8533.out
> Loading icc version 2022.0.2
> Loading compiler-rt version 2022.0.2
> srun: error: CPU binding outside of job step allocation, allocated CPUs
> are: 0x000000000001000000000001.
> srun: error: Task launch for StepId=8533.0 failed on node gpu-5-2: Unable
> to satisfy cpu bind request
> srun: error: Application launch failed: Unable to satisfy cpu bind request
> srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
> slurmstepd: error: *** STEP 8533.0 ON gpu-5-1 CANCELLED AT
> 2022-06-10T09:38:19 ***
> srun: error: gpu-5-1: tasks 0-46: Killed
> mv: cannot stat ‘hpccoutf.txt’: No such file or directory
> [crutledge at largemem-5-1 gpu-6]$
>

-- 
Dr Will Furnass | Research Platforms Engineer
IT Services | University of Sheffield
+44 (0)114 22 29693
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20221114/0840f5d5/attachment.htm>