Hi Josef, hi list,
I now rebuild the rpms from OpenHPC but using the original sources form version 23.11.4.
The configure command that is genereated from the spec is the following:
./configure --build=x86_64-redhat-linux-gnu \ --host=x86_64-redhat-linux-gnu \ --program-prefix= \ --disable-dependency-tracking \ --prefix=/usr \ --exec-prefix=/usr \ --bindir=/usr/bin \ --sbindir=/usr/sbin \ --sysconfdir=/etc/slurm \ --datadir=/usr/share \ --includedir=/usr/include \ --libdir=/usr/lib64 \ --libexecdir=/usr/libexec \ --localstatedir=/var \ --sharedstatedir=/var/lib \ --mandir=/usr/share/man \ --infodir=/usr/share/info \ --enable-multiple-slurmd \ --with-pmix=/opt/ohpc/admin/pmix \ --with-hwloc=/opt/ohpc/pub/libs/hwloc
(Do I miss something here)
the configure output shows:
[...] checking for bpf installation... /usr checking for dbus-1... yes [...]
config.log
dbus_CFLAGS='-I/usr/include/dbus-1.0 -I/usr/lib64/dbus-1.0/include ' dbus_LIBS='-ldbus-1
confdefs.h. #define WITH_CGROUP 1 #define HAVE_BPF 1
However I still can't see any CPU limits when I use sbatch to run a batch job.
$ sbatch --time 5 --ntasks-per-node=1 --nodes=1 --cpus-per-task=1 --wrap 'grep Cpus /proc/$$/status'
$ cat slurm-72.out Cpus_allowed: ffffffff,ffffffff,ffffffff Cpus_allowed_list: 0-95
The logs from the head node (leto) and the compute node (apollo-01) are showing:
Feb 29 12:55:05 leto slurmctld[272883]: slurmctld: _slurm_rpc_submit_batch_job: JobId=72 InitPrio=1 usec=365 Feb 29 12:55:05 apollo-01 slurmd[172835]: slurmd: task/affinity: task_p_slurmd_batch_request: task_p_slurmd_batch_request: 72 Feb 29 12:55:05 apollo-01 slurmd[172835]: slurmd: task/affinity: task_p_slurmd_batch_request: task_p_slurmd_batch_request: 72 Feb 29 12:55:05 apollo-01 slurmd[172835]: slurmd: task/affinity: batch_bind: job 72 CPU input mask for node: 0xFFFFFFFFFFFFFFFFFFFFFFFF Feb 29 12:55:05 apollo-01 slurmd[172835]: slurmd: task/affinity: batch_bind: job 72 CPU input mask for node: 0xFFFFFFFFFFFFFFFFFFFFFFFF Feb 29 12:55:05 apollo-01 slurmd[172835]: slurmd: task/affinity: batch_bind: job 72 CPU final HW mask for node: 0xFFFFFFFFFFFFFFFFFFFFFFFF Feb 29 12:55:05 apollo-01 slurmd[172835]: slurmd: task/affinity: batch_bind: job 72 CPU final HW mask for node: 0xFFFFFFFFFFFFFFFFFFFFFFFF Feb 29 12:55:05 apollo-01 slurmd[172835]: slurmd: Launching batch job 72 for UID 50001 Feb 29 12:55:05 apollo-01 slurmd[172835]: slurmd: Launching batch job 72 for UID 50001 Feb 29 12:55:06 apollo-01 kernel: slurm.epilog.cl (172966): drop_caches: 3 Feb 29 12:55:06 apollo-01 kernel: slurm.epilog.cl (172966): drop_caches: 3 Feb 29 12:55:05 leto slurmctld[272883]: slurmctld: sched/backfill: _start_job: Started JobId=72 in standard on apollo-01 Feb 29 12:55:05 leto slurmctld[272883]: slurmctld: _job_complete: JobId=72 WEXITSTATUS 0 Feb 29 12:55:05 leto slurmctld[272883]: slurmctld: _job_complete: JobId=72 done
Best Dietmar
On 2/28/24 16:25, Josef Dvoracek via slurm-users wrote:
I'm running slurm 22.05.11 which is available with OpenHCP 3.x Do you think an upgrade is needed?
I feel that lot of slurm operators tend to not use 3rd party sources of slurm binaries, as you do not have the build environment fully in your hands.
But before making such a complex decision, perhaps look for build logs of slurm you use (somewhere in OpenHPC buildsystem?) and check if it was built with libraries needed to have cgroupsv2 working..
Not having cgroupsv2 dependencies during build-time is only one of all possible causes..
josef