Hello, I have a test cluster consist of two nodes, one as controller and the other as compute node. I followed all the steps from slurm documentation and I want to run jobs as containers but I get the following error when running podman run hello-world on controller node:
time="2024-08-06T12:02:54+02:00" level=warning msg="freezer not supported: openat2 /sys/fs/cgroup/system.slice/slurmstepd.scope/job_332/step_0/user/arlvm6.ara.332.0.0/cgroup.freeze: no such file or directory" srun: error: arlvm6: task 0: Exited with exit code 1 time="2024-08-06T12:02:54+02:00" level=warning msg="lstat /sys/fs/cgroup/system.slice/slurmstepd.scope/job_332/step_0/user/arlvm6.ara.332.0.0: no such file or directory" time="2024-08-06T12:02:54+02:00" level=error msg="runc run failed: unable to start container process: unable to apply cgroup configuration: rootless needs no limits + no cgrouppath when no permission is granted for cgroups: mkdir /sys/fs/cgroup/system.slice/slurmstepd.scope/job_332/step_0/user/arlvm6.ara.332.0.0: permission denied"
As I tracked on the compute node this path exists /sys/fs/cgroup/system.slice/slurmstepd.scope/ but it looks that could not create the job_332/step_0/user/arlvm6.ara.332.0.0 .
The cgroup.conf:
CgroupPlugin=cgroup/v2 ConstrainCores=yes ConstrainDevices=yes ConstrainRAMSpace=yes ConstrainSwapSpace=yes
It is the oci.conf:
EnvExclude="^(SLURM_CONF|SLURM_CONF_SERVER)=" RunTimeEnvExclude="^(SLURM_CONF|SLURM_CONF_SERVER)=" RunTimeQuery="runc --rootless=true --root=/run/user/1223609544/ state %n.%u.%j.%s.%t" RunTimeKill="runc --rootless=true --root=/run/user/1223609544/ kill -a %n.%u.%j.%s.%t SIGKILL" RunTimeDelete="runc --rootless=true --root=/run/user/1223609544/ delete --force %n.%u.%j.%s.%t" RunTimeRun="runc --rootless=true --root=/run/user/1223609544/ run %n.%u.%j.%s.%t -b %b"
As you see I changed the kill command a bit because without SIGKILL param it could not kill the containers. I test again the oci run time on both controller and compute nodes and I think might be helpful to mention two points: the delete command will not work because if you kill the container then there is no resource to be deleted at least in my tests. there is no pause and resume in oci.conf but I test them and got the same error for freezer support and cgroup permissions.