[slurm-users] Re: [EXTERN] Re: sbatch and cgroup v2

29 Feb 2024


      Hi list,
I finally got it working, I completely overlooked that I set 
Oversubscribe=EXCLUSIVE for the partition that I used to test, stupid 
me.....
sorry for the noise and thanks again for your answers
Best
    Dietmar
On 2/29/24 13:19, Dietmar Rieder via slurm-users wrote:
...
Hi Josef, hi list,
I now rebuild the rpms from OpenHPC but using the original sources form 
version 23.11.4.
The configure command that is genereated from the spec is the following:
./configure --build=x86_64-redhat-linux-gnu \
--host=x86_64-redhat-linux-gnu \
--program-prefix= \
--disable-dependency-tracking \
--prefix=/usr \
--exec-prefix=/usr \
--bindir=/usr/bin \
--sbindir=/usr/sbin \
--sysconfdir=/etc/slurm \
--datadir=/usr/share \
--includedir=/usr/include \
--libdir=/usr/lib64 \
--libexecdir=/usr/libexec \
--localstatedir=/var \
--sharedstatedir=/var/lib \
--mandir=/usr/share/man \
--infodir=/usr/share/info \
--enable-multiple-slurmd \
--with-pmix=/opt/ohpc/admin/pmix \
--with-hwloc=/opt/ohpc/pub/libs/hwloc
(Do I miss something here)
the configure output shows:
[...]
checking for bpf installation... /usr
checking for dbus-1... yes
[...]
config.log
dbus_CFLAGS='-I/usr/include/dbus-1.0 -I/usr/lib64/dbus-1.0/include '
dbus_LIBS='-ldbus-1
confdefs.h.
#define WITH_CGROUP 1
#define HAVE_BPF 1
However I still can't see any CPU limits when I use sbatch to run a 
batch job.
$ sbatch --time 5 --ntasks-per-node=1 --nodes=1 --cpus-per-task=1 --wrap 
'grep Cpus /proc/$$/status'
$ cat slurm-72.out
Cpus_allowed:   ffffffff,ffffffff,ffffffff
Cpus_allowed_list:      0-95
The logs from the head node (leto) and the compute node (apollo-01) are 
showing:
Feb 29 12:55:05 leto slurmctld[272883]: slurmctld: 
_slurm_rpc_submit_batch_job: JobId=72 InitPrio=1 usec=365
Feb 29 12:55:05 apollo-01 slurmd[172835]: slurmd: task/affinity: 
task_p_slurmd_batch_request: task_p_slurmd_batch_request: 72
Feb 29 12:55:05 apollo-01 slurmd[172835]: slurmd: task/affinity: 
task_p_slurmd_batch_request: task_p_slurmd_batch_request: 72
Feb 29 12:55:05 apollo-01 slurmd[172835]: slurmd: task/affinity: 
batch_bind: job 72 CPU input mask for node: 0xFFFFFFFFFFFFFFFFFFFFFFFF
Feb 29 12:55:05 apollo-01 slurmd[172835]: slurmd: task/affinity: 
batch_bind: job 72 CPU input mask for node: 0xFFFFFFFFFFFFFFFFFFFFFFFF
Feb 29 12:55:05 apollo-01 slurmd[172835]: slurmd: task/affinity: 
batch_bind: job 72 CPU final HW mask for node: 0xFFFFFFFFFFFFFFFFFFFFFFFF
Feb 29 12:55:05 apollo-01 slurmd[172835]: slurmd: task/affinity: 
batch_bind: job 72 CPU final HW mask for node: 0xFFFFFFFFFFFFFFFFFFFFFFFF
Feb 29 12:55:05 apollo-01 slurmd[172835]: slurmd: Launching batch job 72 
for UID 50001
Feb 29 12:55:05 apollo-01 slurmd[172835]: slurmd: Launching batch job 72 
for UID 50001
Feb 29 12:55:06 apollo-01 kernel: slurm.epilog.cl (172966): drop_caches: 3
Feb 29 12:55:06 apollo-01 kernel: slurm.epilog.cl (172966): drop_caches: 3
Feb 29 12:55:05 leto slurmctld[272883]: slurmctld: sched/backfill: 
_start_job: Started JobId=72 in standard on apollo-01
Feb 29 12:55:05 leto slurmctld[272883]: slurmctld: _job_complete: 
JobId=72 WEXITSTATUS 0
Feb 29 12:55:05 leto slurmctld[272883]: slurmctld: _job_complete: 
JobId=72 done
Best
   Dietmar
On 2/28/24 16:25, Josef Dvoracek via slurm-users wrote:
...
> I'm running slurm 22.05.11 which is available with OpenHCP 3.x
 > Do you think an upgrade is needed?
I feel that lot of slurm operators tend to not use 3rd party sources 
of slurm binaries, as you do not have the build environment fully in 
your hands.
But before making such a complex decision, perhaps look for build logs 
of slurm you use (somewhere in OpenHPC buildsystem?) and check if it 
was built with libraries needed to have cgroupsv2 working..
Not having cgroupsv2 dependencies during build-time is only one of all 
possible causes..
josef

2025

2024

[slurm-users] Re: [EXTERN] Re: sbatch and cgroup v2