Hi, I am building a containerized Slurm cluster with Ubuntu 20.04 and have it almost working.
The daemons start, and an “sinfo” command shows compute nodes up and available:
admin@slurmfrontend:~$ sinfo PARTITION AVAIL TIMELIMIT NODES STATE NODELIST slurmpar* up infinite 3 idle slurmnode[1-3] admin@slurmfrontend:~$
However, if I try to use “srun” to test a job submission it fails saying it could not execve the job:
admin@slurmfrontend:~$ srun hostname srun: error: task 0 launch failed: Slurmd could not execve job slurmstepd: error: task_g_set_affinity: Operation not permitted slurmstepd: error: _exec_wait_child_wait_for_parent: failed: No error admin@slurmfrontend:~$
If I go to the slurmnode1 container where the job should run, and look at the slurmd log, all I see is this:
admin@slurmnode1:/$ sudo cat /var/log/slurmd.log [2024-06-13T14:58:36.238] CPU frequency setting not configured for this node [2024-06-13T14:58:36.239] warning: Core limit is only 0 KB [2024-06-13T14:58:36.239] slurmd version 23.11.7 started [2024-06-13T14:58:36.243] slurmd started on Thu, 13 Jun 2024 14:58:36 +0000 [2024-06-13T14:58:36.243] CPUs=8 Boards=1 Sockets=8 Cores=1 Threads=1 Memory=47926 TmpDisk=59767 Uptime=71713 CPUSpecList=(null) FeaturesAvail=(null) FeaturesActive=(null) [2024-06-13T14:58:47.230] launch task StepId=1.0 request from UID:1000 GID:1000 HOST:172.20.0.2 PORT:50618 [2024-06-13T14:58:47.230] task/affinity: lllp_distribution: JobId=1 implicit auto binding: sockets,one_thread, dist 8192 [2024-06-13T14:58:47.230] task/affinity: _task_layout_lllp_cyclic: _task_layout_lllp_cyclic [2024-06-13T14:58:47.230] task/affinity: _lllp_generate_cpu_bind: _lllp_generate_cpu_bind jobid [1]: mask_cpu,one_thread, 0x01 [2024-06-13T14:58:47.243] [1.0] error: task_g_set_affinity: Operation not permitted [2024-06-13T14:58:47.243] [1.0] error: _exec_wait_child_wait_for_parent: failed: No error [2024-06-13T14:58:47.244] [1.0] error: job_manager: exiting abnormally: Slurmd could not execve job [2024-06-13T14:58:47.247] [1.0] stepd_cleanup: done with step (rc[0xfb4]:Slurmd could not execve job, cleanup_rc[0xfb4]:Slurmd could not execve job) admin@slurmnode1:/$
I’ve installed by following the instructions for building/installing the Debian RPMs and can see that all the daemons are up and running.
I have this slurm.conf (on all nodes):
admin@slurmfrontend:~$ grep -v '#' /etc/slurm/slurm.conf ClusterName=cluster SlurmctldHost=slurmmaster MpiDefault=none ProctrackType=proctrack/linuxproc ReturnToService=1 SlurmdParameters=config_overrides SlurmctldPidFile=/var/run/slurmctld.pid SlurmctldPort=6817 SlurmdPidFile=/var/run/slurmd.pid SlurmdPort=6818 SlurmdSpoolDir=/var/spool/slurmd SlurmUser=root StateSaveLocation=/var/spool/slurmctld SwitchType=switch/none TaskPlugin=task/affinity InactiveLimit=0 KillWait=30 MinJobAge=300 SlurmctldTimeout=120 SlurmdTimeout=300 Waittime=0 SchedulerType=sched/backfill SelectType=select/cons_tres AccountingStorageType=accounting_storage/none JobCompType=jobcomp/none JobAcctGatherFrequency=30 JobAcctGatherType=jobacct_gather/none SlurmctldDebug=info SlurmctldLogFile=/var/log/slurmctld.log SlurmdDebug=info SlurmdLogFile=/var/log/slurmd.log NodeName=slurmnode[1-3] CPUs=8 State=UNKNOWN PartitionName=slurmpar Nodes=ALL Default=YES MaxTime=INFINITE State=UP admin@slurmfrontend:~$
And I have this group.conf (on all nodes):
admin@slurmfrontend:~$ grep -v '#' /etc/slurm/cgroup.conf CgroupPlugin=cgroup/v1
ConstrainCores=yes ConstrainDevices=yes ConstrainRAMSpace=yes admin@slurmfrontend:~$
Does anyone have any clues about where to look for why “srun” can’t run a job and where the "task_g_set_affinity: Operation not permitted” may be coming from?
Chris --------------------------------------------------------------------------------------------------- Christopher W. Harrop voice: (720) 649-0316 NOAA Global Systems Laboratory, R/GSL6 fax: (303) 497-7259 325 Broadway Boulder, CO 80303