There is a permission problem somewhere, but I don’t know where.
If I run as root, it works:
admin@slurmfrontend:~$ srun hostname srun: error: task 0 launch failed: Slurmd could not execve job slurmstepd: error: task_g_set_affinity: Operation not permitted slurmstepd: error: _exec_wait_child_wait_for_parent: failed: No error
admin@slurmfrontend:~$ sudo srun hostname slurmnode1
admin@slurmfrontend:~$ sudo srun -N 3 hostname slurmnode1 slurmnode3 slurmnode2 admin@slurmfrontend:~$
Chris --------------------------------------------------------------------------------------------------- Christopher W. Harrop voice: (720) 649-0316 NOAA Global Systems Laboratory, R/GSL6 fax: (303) 497-7259 325 Broadway Boulder, CO 80303
I believe I have solved this. I changed the configuration to replace:
TaskPlugin=task/affinity
with:
TaskPlugin=task/none
In my case, the login node, the head node, and all of the compute nodes are running in their own containers. And docker compose is used to run all of those containers to create a containerized Slurm cluster running on a single physical host. So, I think the "TaskPlugin=task/none" setting is required.
If anyone has any other recommendations, please let me know.