admin@slurmfrontend:~$ sinfo

PARTITION AVAIL TIMELIMIT NODES STATE NODELIST

slurmpar* up infinite 3 idle slurmnode[1-3]

admin@slurmfrontend:~$

However, if I try to use “srun” to test a job submission it fails saying it could not execve the job:

admin@slurmfrontend:~$ srun hostname

srun: error: task 0 launch failed: Slurmd could not execve job

slurmstepd: error: task_g_set_affinity: Operation not permitted

slurmstepd: error: _exec_wait_child_wait_for_parent: failed: No error

admin@slurmfrontend:~$

If I go to the slurmnode1 container where the job should run, and look at the slurmd log, all I see is this:

admin@slurmnode1:/$ sudo cat /var/log/slurmd.log

[2024-06-13T14:58:36.238] CPU frequency setting not configured for this node

[2024-06-13T14:58:36.239] warning: Core limit is only 0 KB

[2024-06-13T14:58:36.239] slurmd version 23.11.7 started

[2024-06-13T14:58:36.243] slurmd started on Thu, 13 Jun 2024 14:58:36 +0000

[2024-06-13T14:58:36.243] CPUs=8 Boards=1 Sockets=8 Cores=1 Threads=1 Memory=47926 TmpDisk=59767 Uptime=71713 CPUSpecList=(null) FeaturesAvail=(null) FeaturesActive=(null)

[2024-06-13T14:58:47.230] launch task StepId=1.0 request from UID:1000 GID:1000 HOST:172.20.0.2 PORT:50618

[2024-06-13T14:58:47.230] task/affinity: lllp_distribution: JobId=1 implicit auto binding: sockets,one_thread, dist 8192

[2024-06-13T14:58:47.230] task/affinity: _task_layout_lllp_cyclic: _task_layout_lllp_cyclic

[2024-06-13T14:58:47.230] task/affinity: _lllp_generate_cpu_bind: _lllp_generate_cpu_bind jobid [1]: mask_cpu,one_thread, 0x01

[2024-06-13T14:58:47.243] [1.0] error: task_g_set_affinity: Operation not permitted

[2024-06-13T14:58:47.243] [1.0] error: _exec_wait_child_wait_for_parent: failed: No error

[2024-06-13T14:58:47.244] [1.0] error: job_manager: exiting abnormally: Slurmd could not execve job

[2024-06-13T14:58:47.247] [1.0] stepd_cleanup: done with step (rc[0xfb4]:Slurmd could not execve job, cleanup_rc[0xfb4]:Slurmd could not execve job)

admin@slurmnode1:/$

I’ve installed by following the instructions for building/installing the Debian RPMs and can see that all the daemons are up and running.

I have this slurm.conf (on all nodes):

admin@slurmfrontend:~$ grep -v '#' /etc/slurm/slurm.conf

ClusterName=cluster

SlurmctldHost=slurmmaster

MpiDefault=none

ProctrackType=proctrack/linuxproc

ReturnToService=1

SlurmdParameters=config_overrides

SlurmctldPidFile=/var/run/slurmctld.pid

SlurmctldPort=6817

SlurmdPidFile=/var/run/slurmd.pid

SlurmdPort=6818

SlurmdSpoolDir=/var/spool/slurmd

SlurmUser=root

StateSaveLocation=/var/spool/slurmctld

SwitchType=switch/none

TaskPlugin=task/affinity

InactiveLimit=0

KillWait=30

MinJobAge=300

SlurmctldTimeout=120

SlurmdTimeout=300

Waittime=0

SchedulerType=sched/backfill

SelectType=select/cons_tres

AccountingStorageType=accounting_storage/none

JobCompType=jobcomp/none

JobAcctGatherFrequency=30

JobAcctGatherType=jobacct_gather/none

SlurmctldDebug=info

SlurmctldLogFile=/var/log/slurmctld.log

SlurmdDebug=info

SlurmdLogFile=/var/log/slurmd.log

NodeName=slurmnode[1-3] CPUs=8 State=UNKNOWN

PartitionName=slurmpar Nodes=ALL Default=YES MaxTime=INFINITE State=UP

admin@slurmfrontend:~$

And I have this group.conf (on all nodes):

admin@slurmfrontend:~$ grep -v '#' /etc/slurm/cgroup.conf

CgroupPlugin=cgroup/v1

ConstrainCores=yes

ConstrainDevices=yes

ConstrainRAMSpace=yes

admin@slurmfrontend:~$

Does anyone have any clues about where to look for why “srun” can’t run a job and where the "task_g_set_affinity: Operation not permitted” may be coming from?

Chris

---------------------------------------------------------------------------------------------------
Christopher W. Harrop voice: (720) 649-0316
NOAA Global Systems Laboratory, R/GSL6 fax: (303) 497-7259
325 Broadway
Boulder, CO 80303