Hi,
At our site we have recently upgraded to Slurm 23.11.5 and are having trouble with MPI jobs doing srun inside a sbatch'ed script.
The cgroup does not appear to be setup correctly for the srun (step_0).
As an example
$ cat /sys/fs/cgroup/cpuset/slurm/uid_11000..../job..../cpuset.cpus
0,2-3,68-69,96,98-99,164-165
$ cat /sys/fs/cgroup/cpuset/slurm/uid_11000..../job..../step_0/cpuset.cpus
0,2,68,96,98,164
The sbatch is allocated a range of cpus in the cgroup. However, when step_0 is run, only some of those CPUs are in the group.
I have noticed that it is always the range which is missing, ie 2-5 only 2 is included, 3,4,5 are missing.
This also only happens if there are multiple groups of cpus in the allocations. ie only 1-12 would be fine, however 1-12,15-20 would result in 1,15 only.
The sbatch also seems fine, with step_batch and step_extern being allocated correctly.
This causes numerous issues with MPI jobs as they end up overloading cpus.
We are running our nodes with threading enabled on the CPUs, and with cgroups and affinity plugins.
I have attached our slurm.conf to show our settings.
Our /etc/slurm/cgroup.conf is
ConstrainCores=yes
ConstrainDevices=yes
ConstrainRAMSpace=yes
ConstrainSwapSpace=yes
We have turned on logging at debug2 level, but I haven't yet found anything useful. Happy for a suggestion on what to look for.
Is anyone able to provide any advice on where to go next to try and identify the issue?
Regards,
Ashley Wright