[slurm-users] Allocation failure when using heterogeneous jobs with sbatch

Thu Jun 2 15:30:35 UTC 2022

Hi all,

I'm trying to use heterogeneous jobs with the following slurm script:

#!/usr/bin/env bash

#SBATCH --partition=cpu --time=01:00:00 --nodes=2 --ntasks-per-node=1 --cpus-per-task=2 --mem=8G

#SBATCH hetjob

#SBATCH --partition=gpu --time=01:00:00 --nodes=2 --ntasks-per-node=1 --cpus-per-task=2 --mem=8G --gres=gpu:1

srun \

    --het-group=0 -K sh -c 'echo group 0 $(hostname) $SLURM_PROCID' : \

    --het-group=1 -K sh -c 'echo group 1 $(hostname) $SLURM_PROCID'

It works when I manually run the commands via salloc, but it fails via sbatch with the following message:

srun: error: Allocation failure of 2 nodes: job size of 2, already allocated 2 nodes to previous components.

Am I misunderstanding the sbatch documentation? Is it normal that sbatch and salloc behave differently?

Note: with salloc the job script runs on the slurmctld server whereas with sbatch it runs on the first node allocated to the batch. Slurm is in version 20.11.3.

Best regards,
Nicolas
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20220602/642c4933/attachment.htm>