admin@slurmfrontend:~$ sinfo
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
slurmpar* up infinite 3 idle slurmnode[1-3]
admin@slurmfrontend:~$
However, if I try to use “srun” to test a job submission it fails saying it could not execve the job:
admin@slurmfrontend:~$ srun hostname
srun: error: task 0 launch failed: Slurmd could not execve job
slurmstepd: error: task_g_set_affinity: Operation not permitted
slurmstepd: error: _exec_wait_child_wait_for_parent: failed: No error
admin@slurmfrontend:~$
If I go to the slurmnode1 container where the job should run, and look at the slurmd log, all I see is this:
admin@slurmnode1:/$ sudo cat /var/log/slurmd.log
[2024-06-13T14:58:36.238] CPU frequency setting not configured for this node
[2024-06-13T14:58:36.239] warning: Core limit is only 0 KB
[2024-06-13T14:58:36.239] slurmd version 23.11.7 started
[2024-06-13T14:58:36.243] slurmd started on Thu, 13 Jun 2024 14:58:36 +0000
[2024-06-13T14:58:36.243] CPUs=8 Boards=1 Sockets=8 Cores=1 Threads=1 Memory=47926 TmpDisk=59767 Uptime=71713 CPUSpecList=(null) FeaturesAvail=(null) FeaturesActive=(null)
[2024-06-13T14:58:47.230] launch task StepId=1.0 request from UID:1000 GID:1000 HOST:172.20.0.2 PORT:50618
[2024-06-13T14:58:47.230] task/affinity: lllp_distribution: JobId=1 implicit auto binding: sockets,one_thread, dist 8192
[2024-06-13T14:58:47.230] task/affinity: _task_layout_lllp_cyclic: _task_layout_lllp_cyclic
[2024-06-13T14:58:47.230] task/affinity: _lllp_generate_cpu_bind: _lllp_generate_cpu_bind jobid [1]: mask_cpu,one_thread, 0x01
[2024-06-13T14:58:47.243] [1.0] error: task_g_set_affinity: Operation not permitted
[2024-06-13T14:58:47.243] [1.0] error: _exec_wait_child_wait_for_parent: failed: No error
[2024-06-13T14:58:47.244] [1.0] error: job_manager: exiting abnormally: Slurmd could not execve job
[2024-06-13T14:58:47.247] [1.0] stepd_cleanup: done with step (rc[0xfb4]:Slurmd could not execve job, cleanup_rc[0xfb4]:Slurmd could not execve job)
admin@slurmnode1:/$
I’ve installed by following the instructions for building/installing the Debian RPMs and can see that all the daemons are up and running.
I have this slurm.conf (on all nodes):
admin@slurmfrontend:~$ grep -v '#' /etc/slurm/slurm.conf
ClusterName=cluster
SlurmctldHost=slurmmaster
MpiDefault=none
ProctrackType=proctrack/linuxproc
ReturnToService=1
SlurmdParameters=config_overrides
SlurmctldPidFile=/var/run/slurmctld.pid
SlurmctldPort=6817
SlurmdPidFile=/var/run/slurmd.pid
SlurmdPort=6818
SlurmdSpoolDir=/var/spool/slurmd
SlurmUser=root
StateSaveLocation=/var/spool/slurmctld
SwitchType=switch/none
TaskPlugin=task/affinity
InactiveLimit=0
KillWait=30
MinJobAge=300
SlurmctldTimeout=120
SlurmdTimeout=300
Waittime=0
SchedulerType=sched/backfill
SelectType=select/cons_tres
AccountingStorageType=accounting_storage/none
JobCompType=jobcomp/none
JobAcctGatherFrequency=30
JobAcctGatherType=jobacct_gather/none
SlurmctldDebug=info
SlurmctldLogFile=/var/log/slurmctld.log
SlurmdDebug=info
SlurmdLogFile=/var/log/slurmd.log
NodeName=slurmnode[1-3] CPUs=8 State=UNKNOWN
PartitionName=slurmpar Nodes=ALL Default=YES MaxTime=INFINITE State=UP
admin@slurmfrontend:~$
And I have this group.conf (on all nodes):
admin@slurmfrontend:~$ grep -v '#' /etc/slurm/cgroup.conf
CgroupPlugin=cgroup/v1
ConstrainCores=yes
ConstrainDevices=yes
ConstrainRAMSpace=yes
admin@slurmfrontend:~$
Does anyone have any clues about where to look for why “srun” can’t run a job and where the "task_g_set_affinity: Operation not permitted” may be coming from?
Chris
---------------------------------------------------------------------------------------------------
Christopher W. Harrop voice: (720) 649-0316
NOAA Global Systems Laboratory, R/GSL6 fax: (303) 497-7259
325 Broadway
Boulder, CO 80303