Hi,
I'm new to slrum, but maybe someone can help me:
I'm trying to restrict the CPU usage to the actually requested/allocated resources using cgroup v2.
For this I made the following settings in slurmd.conf:
ProctrackType=proctrack/cgroup TaskPlugin=task/cgroup,task/affinity
And in cgroup.conf
CgroupPlugin=cgroup/v2 CgroupAutomount=yes ConstrainCores=yes ConstrainRAMSpace=yes ConstrainDevices=yes AllowedRAMSpace=98
cgroup v2 seems to be active on the compute node:
# mount | grep cgroup cgroup2 on /sys/fs/cgroup type cgroup2 (rw,nosuid,nodev,noexec,relatime,nsdelegate,memory_recursiveprot)
# cat /sys/fs/cgroup/cgroup.subtree_control cpuset cpu io memory pids # cat /sys/fs/cgroup/system.slice/cgroup.subtree_control cpuset cpu io memory pids
Now, when I use sbatch to submit the following test script, the python script which is started from the batch script is utilizing all CPUs (96) at 100% on the allocated node, although I only ask for 4 cpus (--cpus-per-task=4). I'd expect that the task can not use more that these 4.
#!/bin/bash #SBATCH --output=/local/users/appadmin/test-%j.log #SBATCH --job-name=test #SBATCH --chdir=/local/users/appadmin #SBATCH --cpus-per-task=4 #SBATCH --ntasks=1 #SBATCH --nodes=1 #SBATCH --mem=64gb #SBATCH --time=4:00:00 #SBATCH --partition=standard #SBATCH --gpus=0 #SBATCH --export #SBATCH --get-user-env=L
export PATH=/usr/local/bioinf/jupyterhub/bin:/usr/local/bioinf/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/usr/local/bioinf/miniforge/condabin
source .bashrc conda activate test python test.py
The python code in test.py is the following using the cpu_load_generator package from [1]:
#!/usr/bin/env python
import sys from cpu_load_generator import load_single_core, load_all_cores, from_profile
load_all_cores(duration_s=120, target_load=1) # generates load on all cores
Interestingly, when I use srun to launch an interactive job, and run the python script manually, I see with top that only 4 cpus are running at 100%. And I also python errors thrown when the script tries to start the 5th process (which makes sense):
File "/usr/local/bioinf/miniforge/envs/test/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap self.run() File "/usr/local/bioinf/miniforge/envs/test/lib/python3.12/multiprocessing/process.py", line 108, in run self._target(*self._args, **self._kwargs) File "/usr/local/bioinf/miniforge/envs/test/lib/python3.12/site-packages/cpu_load_generator/_interface.py", line 24, in load_single_core process.cpu_affinity([core_num]) File "/usr/local/bioinf/miniforge/envs/test/lib/python3.12/site-packages/psutil/__init__.py", line 867, in cpu_affinity self._proc.cpu_affinity_set(list(set(cpus))) File "/usr/local/bioinf/miniforge/envs/test/lib/python3.12/site-packages/psutil/_pslinux.py", line 1714, in wrapper return fun(self, *args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/bioinf/miniforge/envs/test/lib/python3.12/site-packages/psutil/_pslinux.py", line 2213, in cpu_affinity_set cext.proc_cpu_affinity_set(self.pid, cpus) OSError: [Errno 22] Invalid argument
What am I missing, why are the CPU resources not restricted when I use sbatch?
Thanks for any input or hint Dietmar
Hi Dietmar;
I tried this on ${my cluster}, as I switched to cgroupsv2 quite recently..
I must say that on my setup it looks it works as expected, see the grepped stdout from your reproducer below.
I use recent slurm 23.11.4 .
Wild guess.. Has your build machine bpt and dbus devel packages installed? (both packages are fine to be absent when doing build for cgroupsv1 - slurm..)
cheers
josef
[jose@koios1 test_cgroups]$ cat slurm-7177217.out | grep eli ValueError: CPU number 7 is not eligible; choose between [0, 1, 2, 3] ValueError: CPU number 4 is not eligible; choose between [0, 1, 2, 3] ValueError: CPU number 5 is not eligible; choose between [0, 1, 2, 3] ValueError: CPU number 11 is not eligible; choose between [0, 1, 2, 3] ValueError: CPU number 9 is not eligible; choose between [0, 1, 2, 3] ValueError: CPU number 10 is not eligible; choose between [0, 1, 2, 3] ValueError: CPU number 14 is not eligible; choose between [0, 1, 2, 3] ValueError: CPU number 8 is not eligible; choose between [0, 1, 2, 3] ValueError: CPU number 12 is not eligible; choose between [0, 1, 2, 3] ValueError: CPU number 6 is not eligible; choose between [0, 1, 2, 3] ValueError: CPU number 13 is not eligible; choose between [0, 1, 2, 3] ValueError: CPU number 15 is not eligible; choose between [0, 1, 2, 3] [jose@koios1 test_cgroups]$
On 28. 02. 24 14:28, Dietmar Rieder via slurm-users wrote: ...
Hi,
I'm running slurm 22.05.11 which is available with OpenHCP 3.x Do you think an upgrade is needed?
Best Dietmar
On 2/28/24 14:55, Josef Dvoracek via slurm-users wrote:
Hi Dietmar;
I tried this on ${my cluster}, as I switched to cgroupsv2 quite recently..
I must say that on my setup it looks it works as expected, see the grepped stdout from your reproducer below.
I use recent slurm 23.11.4 .
Wild guess.. Has your build machine bpt and dbus devel packages installed? (both packages are fine to be absent when doing build for cgroupsv1 - slurm..)
cheers
josef
[jose@koios1 test_cgroups]$ cat slurm-7177217.out | grep eli ValueError: CPU number 7 is not eligible; choose between [0, 1, 2, 3] ValueError: CPU number 4 is not eligible; choose between [0, 1, 2, 3] ValueError: CPU number 5 is not eligible; choose between [0, 1, 2, 3] ValueError: CPU number 11 is not eligible; choose between [0, 1, 2, 3] ValueError: CPU number 9 is not eligible; choose between [0, 1, 2, 3] ValueError: CPU number 10 is not eligible; choose between [0, 1, 2, 3] ValueError: CPU number 14 is not eligible; choose between [0, 1, 2, 3] ValueError: CPU number 8 is not eligible; choose between [0, 1, 2, 3] ValueError: CPU number 12 is not eligible; choose between [0, 1, 2, 3] ValueError: CPU number 6 is not eligible; choose between [0, 1, 2, 3] ValueError: CPU number 13 is not eligible; choose between [0, 1, 2, 3] ValueError: CPU number 15 is not eligible; choose between [0, 1, 2, 3] [jose@koios1 test_cgroups]$
On 28. 02. 24 14:28, Dietmar Rieder via slurm-users wrote: ...
I'm running slurm 22.05.11 which is available with OpenHCP 3.x Do you think an upgrade is needed?
I feel that lot of slurm operators tend to not use 3rd party sources of slurm binaries, as you do not have the build environment fully in your hands.
But before making such a complex decision, perhaps look for build logs of slurm you use (somewhere in OpenHPC buildsystem?) and check if it was built with libraries needed to have cgroupsv2 working..
Not having cgroupsv2 dependencies during build-time is only one of all possible causes..
josef
Hi Josef, hi list,
I now rebuild the rpms from OpenHPC but using the original sources form version 23.11.4.
The configure command that is genereated from the spec is the following:
./configure --build=x86_64-redhat-linux-gnu \ --host=x86_64-redhat-linux-gnu \ --program-prefix= \ --disable-dependency-tracking \ --prefix=/usr \ --exec-prefix=/usr \ --bindir=/usr/bin \ --sbindir=/usr/sbin \ --sysconfdir=/etc/slurm \ --datadir=/usr/share \ --includedir=/usr/include \ --libdir=/usr/lib64 \ --libexecdir=/usr/libexec \ --localstatedir=/var \ --sharedstatedir=/var/lib \ --mandir=/usr/share/man \ --infodir=/usr/share/info \ --enable-multiple-slurmd \ --with-pmix=/opt/ohpc/admin/pmix \ --with-hwloc=/opt/ohpc/pub/libs/hwloc
(Do I miss something here)
the configure output shows:
[...] checking for bpf installation... /usr checking for dbus-1... yes [...]
config.log
dbus_CFLAGS='-I/usr/include/dbus-1.0 -I/usr/lib64/dbus-1.0/include ' dbus_LIBS='-ldbus-1
confdefs.h. #define WITH_CGROUP 1 #define HAVE_BPF 1
However I still can't see any CPU limits when I use sbatch to run a batch job.
$ sbatch --time 5 --ntasks-per-node=1 --nodes=1 --cpus-per-task=1 --wrap 'grep Cpus /proc/$$/status'
$ cat slurm-72.out Cpus_allowed: ffffffff,ffffffff,ffffffff Cpus_allowed_list: 0-95
The logs from the head node (leto) and the compute node (apollo-01) are showing:
Feb 29 12:55:05 leto slurmctld[272883]: slurmctld: _slurm_rpc_submit_batch_job: JobId=72 InitPrio=1 usec=365 Feb 29 12:55:05 apollo-01 slurmd[172835]: slurmd: task/affinity: task_p_slurmd_batch_request: task_p_slurmd_batch_request: 72 Feb 29 12:55:05 apollo-01 slurmd[172835]: slurmd: task/affinity: task_p_slurmd_batch_request: task_p_slurmd_batch_request: 72 Feb 29 12:55:05 apollo-01 slurmd[172835]: slurmd: task/affinity: batch_bind: job 72 CPU input mask for node: 0xFFFFFFFFFFFFFFFFFFFFFFFF Feb 29 12:55:05 apollo-01 slurmd[172835]: slurmd: task/affinity: batch_bind: job 72 CPU input mask for node: 0xFFFFFFFFFFFFFFFFFFFFFFFF Feb 29 12:55:05 apollo-01 slurmd[172835]: slurmd: task/affinity: batch_bind: job 72 CPU final HW mask for node: 0xFFFFFFFFFFFFFFFFFFFFFFFF Feb 29 12:55:05 apollo-01 slurmd[172835]: slurmd: task/affinity: batch_bind: job 72 CPU final HW mask for node: 0xFFFFFFFFFFFFFFFFFFFFFFFF Feb 29 12:55:05 apollo-01 slurmd[172835]: slurmd: Launching batch job 72 for UID 50001 Feb 29 12:55:05 apollo-01 slurmd[172835]: slurmd: Launching batch job 72 for UID 50001 Feb 29 12:55:06 apollo-01 kernel: slurm.epilog.cl (172966): drop_caches: 3 Feb 29 12:55:06 apollo-01 kernel: slurm.epilog.cl (172966): drop_caches: 3 Feb 29 12:55:05 leto slurmctld[272883]: slurmctld: sched/backfill: _start_job: Started JobId=72 in standard on apollo-01 Feb 29 12:55:05 leto slurmctld[272883]: slurmctld: _job_complete: JobId=72 WEXITSTATUS 0 Feb 29 12:55:05 leto slurmctld[272883]: slurmctld: _job_complete: JobId=72 done
Best Dietmar
On 2/28/24 16:25, Josef Dvoracek via slurm-users wrote:
I'm running slurm 22.05.11 which is available with OpenHCP 3.x Do you think an upgrade is needed?
I feel that lot of slurm operators tend to not use 3rd party sources of slurm binaries, as you do not have the build environment fully in your hands.
But before making such a complex decision, perhaps look for build logs of slurm you use (somewhere in OpenHPC buildsystem?) and check if it was built with libraries needed to have cgroupsv2 working..
Not having cgroupsv2 dependencies during build-time is only one of all possible causes..
josef
Hi list,
I finally got it working, I completely overlooked that I set Oversubscribe=EXCLUSIVE for the partition that I used to test, stupid me.....
sorry for the noise and thanks again for your answers
Best Dietmar
On 2/29/24 13:19, Dietmar Rieder via slurm-users wrote:
Hi Josef, hi list,
I now rebuild the rpms from OpenHPC but using the original sources form version 23.11.4.
The configure command that is genereated from the spec is the following:
./configure --build=x86_64-redhat-linux-gnu \ --host=x86_64-redhat-linux-gnu \ --program-prefix= \ --disable-dependency-tracking \ --prefix=/usr \ --exec-prefix=/usr \ --bindir=/usr/bin \ --sbindir=/usr/sbin \ --sysconfdir=/etc/slurm \ --datadir=/usr/share \ --includedir=/usr/include \ --libdir=/usr/lib64 \ --libexecdir=/usr/libexec \ --localstatedir=/var \ --sharedstatedir=/var/lib \ --mandir=/usr/share/man \ --infodir=/usr/share/info \ --enable-multiple-slurmd \ --with-pmix=/opt/ohpc/admin/pmix \ --with-hwloc=/opt/ohpc/pub/libs/hwloc
(Do I miss something here)
the configure output shows:
[...] checking for bpf installation... /usr checking for dbus-1... yes [...]
config.log
dbus_CFLAGS='-I/usr/include/dbus-1.0 -I/usr/lib64/dbus-1.0/include ' dbus_LIBS='-ldbus-1
confdefs.h. #define WITH_CGROUP 1 #define HAVE_BPF 1
However I still can't see any CPU limits when I use sbatch to run a batch job.
$ sbatch --time 5 --ntasks-per-node=1 --nodes=1 --cpus-per-task=1 --wrap 'grep Cpus /proc/$$/status'
$ cat slurm-72.out Cpus_allowed: ffffffff,ffffffff,ffffffff Cpus_allowed_list: 0-95
The logs from the head node (leto) and the compute node (apollo-01) are showing:
Feb 29 12:55:05 leto slurmctld[272883]: slurmctld: _slurm_rpc_submit_batch_job: JobId=72 InitPrio=1 usec=365 Feb 29 12:55:05 apollo-01 slurmd[172835]: slurmd: task/affinity: task_p_slurmd_batch_request: task_p_slurmd_batch_request: 72 Feb 29 12:55:05 apollo-01 slurmd[172835]: slurmd: task/affinity: task_p_slurmd_batch_request: task_p_slurmd_batch_request: 72 Feb 29 12:55:05 apollo-01 slurmd[172835]: slurmd: task/affinity: batch_bind: job 72 CPU input mask for node: 0xFFFFFFFFFFFFFFFFFFFFFFFF Feb 29 12:55:05 apollo-01 slurmd[172835]: slurmd: task/affinity: batch_bind: job 72 CPU input mask for node: 0xFFFFFFFFFFFFFFFFFFFFFFFF Feb 29 12:55:05 apollo-01 slurmd[172835]: slurmd: task/affinity: batch_bind: job 72 CPU final HW mask for node: 0xFFFFFFFFFFFFFFFFFFFFFFFF Feb 29 12:55:05 apollo-01 slurmd[172835]: slurmd: task/affinity: batch_bind: job 72 CPU final HW mask for node: 0xFFFFFFFFFFFFFFFFFFFFFFFF Feb 29 12:55:05 apollo-01 slurmd[172835]: slurmd: Launching batch job 72 for UID 50001 Feb 29 12:55:05 apollo-01 slurmd[172835]: slurmd: Launching batch job 72 for UID 50001 Feb 29 12:55:06 apollo-01 kernel: slurm.epilog.cl (172966): drop_caches: 3 Feb 29 12:55:06 apollo-01 kernel: slurm.epilog.cl (172966): drop_caches: 3 Feb 29 12:55:05 leto slurmctld[272883]: slurmctld: sched/backfill: _start_job: Started JobId=72 in standard on apollo-01 Feb 29 12:55:05 leto slurmctld[272883]: slurmctld: _job_complete: JobId=72 WEXITSTATUS 0 Feb 29 12:55:05 leto slurmctld[272883]: slurmctld: _job_complete: JobId=72 done
Best Dietmar
On 2/28/24 16:25, Josef Dvoracek via slurm-users wrote:
> I'm running slurm 22.05.11 which is available with OpenHCP 3.x > Do you think an upgrade is needed?
I feel that lot of slurm operators tend to not use 3rd party sources of slurm binaries, as you do not have the build environment fully in your hands.
But before making such a complex decision, perhaps look for build logs of slurm you use (somewhere in OpenHPC buildsystem?) and check if it was built with libraries needed to have cgroupsv2 working..
Not having cgroupsv2 dependencies during build-time is only one of all possible causes..
josef
Hi Dietmar,
what do you find in the output-file of this job
sbatch --time 5 --cpus-per-task=1 --wrap 'grep Cpus /proc/$$/status'
On our 64 cores machines with enabled hyperthreading I see e.g.
Cpus_allowed: 04000000,00000000,04000000,00000000 Cpus_allowed_list: 58,122
Greetings Hermann
On 2/28/24 14:28, Dietmar Rieder via slurm-users wrote:
Hi,
I'm new to slrum, but maybe someone can help me:
I'm trying to restrict the CPU usage to the actually requested/allocated resources using cgroup v2.
For this I made the following settings in slurmd.conf:
ProctrackType=proctrack/cgroup TaskPlugin=task/cgroup,task/affinity
And in cgroup.conf
CgroupPlugin=cgroup/v2 CgroupAutomount=yes ConstrainCores=yes ConstrainRAMSpace=yes ConstrainDevices=yes AllowedRAMSpace=98
cgroup v2 seems to be active on the compute node:
# mount | grep cgroup cgroup2 on /sys/fs/cgroup type cgroup2 (rw,nosuid,nodev,noexec,relatime,nsdelegate,memory_recursiveprot)
# cat /sys/fs/cgroup/cgroup.subtree_control cpuset cpu io memory pids # cat /sys/fs/cgroup/system.slice/cgroup.subtree_control cpuset cpu io memory pids
Now, when I use sbatch to submit the following test script, the python script which is started from the batch script is utilizing all CPUs (96) at 100% on the allocated node, although I only ask for 4 cpus (--cpus-per-task=4). I'd expect that the task can not use more that these 4.
#!/bin/bash #SBATCH --output=/local/users/appadmin/test-%j.log #SBATCH --job-name=test #SBATCH --chdir=/local/users/appadmin #SBATCH --cpus-per-task=4 #SBATCH --ntasks=1 #SBATCH --nodes=1 #SBATCH --mem=64gb #SBATCH --time=4:00:00 #SBATCH --partition=standard #SBATCH --gpus=0 #SBATCH --export #SBATCH --get-user-env=L
export PATH=/usr/local/bioinf/jupyterhub/bin:/usr/local/bioinf/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/usr/local/bioinf/miniforge/condabin
source .bashrc conda activate test python test.py
The python code in test.py is the following using the cpu_load_generator package from [1]:
#!/usr/bin/env python
import sys from cpu_load_generator import load_single_core, load_all_cores, from_profile
load_all_cores(duration_s=120, target_load=1) # generates load on all cores
Interestingly, when I use srun to launch an interactive job, and run the python script manually, I see with top that only 4 cpus are running at 100%. And I also python errors thrown when the script tries to start the 5th process (which makes sense):
File "/usr/local/bioinf/miniforge/envs/test/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap self.run() File "/usr/local/bioinf/miniforge/envs/test/lib/python3.12/multiprocessing/process.py", line 108, in run self._target(*self._args, **self._kwargs) File "/usr/local/bioinf/miniforge/envs/test/lib/python3.12/site-packages/cpu_load_generator/_interface.py", line 24, in load_single_core process.cpu_affinity([core_num]) File "/usr/local/bioinf/miniforge/envs/test/lib/python3.12/site-packages/psutil/__init__.py", line 867, in cpu_affinity self._proc.cpu_affinity_set(list(set(cpus))) File "/usr/local/bioinf/miniforge/envs/test/lib/python3.12/site-packages/psutil/_pslinux.py", line 1714, in wrapper return fun(self, *args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/bioinf/miniforge/envs/test/lib/python3.12/site-packages/psutil/_pslinux.py", line 2213, in cpu_affinity_set cext.proc_cpu_affinity_set(self.pid, cpus) OSError: [Errno 22] Invalid argument
What am I missing, why are the CPU resources not restricted when I use sbatch?
Thanks for any input or hint Dietmar
Hi Hermann,
I get:
Cpus_allowed: ffffffff,ffffffff,ffffffff Cpus_allowed_list: 0-95
Best Dietmar
p.s.: lg aus dem CCB
On 2/28/24 15:01, Hermann Schwärzler via slurm-users wrote:
Hi Dietmar,
what do you find in the output-file of this job
sbatch --time 5 --cpus-per-task=1 --wrap 'grep Cpus /proc/$$/status'
On our 64 cores machines with enabled hyperthreading I see e.g.
Cpus_allowed: 04000000,00000000,04000000,00000000 Cpus_allowed_list: 58,122
Greetings Hermann
On 2/28/24 14:28, Dietmar Rieder via slurm-users wrote:
Hi,
I'm new to slrum, but maybe someone can help me:
I'm trying to restrict the CPU usage to the actually requested/allocated resources using cgroup v2.
For this I made the following settings in slurmd.conf:
ProctrackType=proctrack/cgroup TaskPlugin=task/cgroup,task/affinity
And in cgroup.conf
CgroupPlugin=cgroup/v2 CgroupAutomount=yes ConstrainCores=yes ConstrainRAMSpace=yes ConstrainDevices=yes AllowedRAMSpace=98
cgroup v2 seems to be active on the compute node:
# mount | grep cgroup cgroup2 on /sys/fs/cgroup type cgroup2 (rw,nosuid,nodev,noexec,relatime,nsdelegate,memory_recursiveprot)
# cat /sys/fs/cgroup/cgroup.subtree_control cpuset cpu io memory pids # cat /sys/fs/cgroup/system.slice/cgroup.subtree_control cpuset cpu io memory pids
Now, when I use sbatch to submit the following test script, the python script which is started from the batch script is utilizing all CPUs (96) at 100% on the allocated node, although I only ask for 4 cpus (--cpus-per-task=4). I'd expect that the task can not use more that these 4.
#!/bin/bash #SBATCH --output=/local/users/appadmin/test-%j.log #SBATCH --job-name=test #SBATCH --chdir=/local/users/appadmin #SBATCH --cpus-per-task=4 #SBATCH --ntasks=1 #SBATCH --nodes=1 #SBATCH --mem=64gb #SBATCH --time=4:00:00 #SBATCH --partition=standard #SBATCH --gpus=0 #SBATCH --export #SBATCH --get-user-env=L
export PATH=/usr/local/bioinf/jupyterhub/bin:/usr/local/bioinf/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/usr/local/bioinf/miniforge/condabin
source .bashrc conda activate test python test.py
The python code in test.py is the following using the cpu_load_generator package from [1]:
#!/usr/bin/env python
import sys from cpu_load_generator import load_single_core, load_all_cores, from_profile
load_all_cores(duration_s=120, target_load=1) # generates load on all cores
Interestingly, when I use srun to launch an interactive job, and run the python script manually, I see with top that only 4 cpus are running at 100%. And I also python errors thrown when the script tries to start the 5th process (which makes sense):
File "/usr/local/bioinf/miniforge/envs/test/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap self.run() File "/usr/local/bioinf/miniforge/envs/test/lib/python3.12/multiprocessing/process.py", line 108, in run self._target(*self._args, **self._kwargs) File "/usr/local/bioinf/miniforge/envs/test/lib/python3.12/site-packages/cpu_load_generator/_interface.py", line 24, in load_single_core process.cpu_affinity([core_num]) File "/usr/local/bioinf/miniforge/envs/test/lib/python3.12/site-packages/psutil/__init__.py", line 867, in cpu_affinity self._proc.cpu_affinity_set(list(set(cpus))) File "/usr/local/bioinf/miniforge/envs/test/lib/python3.12/site-packages/psutil/_pslinux.py", line 1714, in wrapper return fun(self, *args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/bioinf/miniforge/envs/test/lib/python3.12/site-packages/psutil/_pslinux.py", line 2213, in cpu_affinity_set cext.proc_cpu_affinity_set(self.pid, cpus) OSError: [Errno 22] Invalid argument
What am I missing, why are the CPU resources not restricted when I use sbatch?
Thanks for any input or hint Dietmar