[slurm-users] Configless node problems.....

Fri Mar 17 17:21:47 UTC 2023

Hi all,

In preparation for deployment in a real world system, I have been trying 
things out on a set of virtual machines arranged as a cluster. One of 
the things I am trying to implement is configless nodes.

I currently have my virtual cluster setup as so :

frontend, frontback
	both run slurmctld & slurmdbd
backend	
	has DNS, user management, shared filesystems and mariadb.

exec1-exec3   local partition nodes
execd1-execd3 dragon partition nodes
execr1-execr3 remote partition nodes

The cluster is spread across 3 physical machines, all linked by a tinc vpn.

This works without problems in the traditional slurm.conf on all nodes 
configuration. Users can submit jobs and they are executed as requested.

All nodes are Rocky Linux 9, slurm 22.05.2.

So to test the configless setup I have done the following :

1) added SlurmctldParameters=enable_configless to slurm.conf on frontend 
& frontback.
Then restarted slurmctld with systemctl restart slurmctld.service

2) added :
_slurmctld._tcp 3600 IN SRV 10 0 6817 frontend
_slurmctld._tcp 3600 IN SRV 0 0 6817 frontback

To the forward lookup (host->ip) file of the DNS server on backend and 
restarted it.

3) removed /etc/slurm/slurm.conf on exec1, and attempted to re-start it 
in configless mode, this is where I get the problem......

Attempting to start slurmd causes it to fail, if I run it in debug mode 
I get :

[root at exec1 slurm]# slurmd -D -vv
slurmd: debug:  Log file re-opened
slurmd: debug:  CPUs:2 Boards:1 Sockets:2 CoresPerSocket:1 ThreadsPerCore:1
slurmd: Node reconfigured socket/core boundaries SocketsPerBoard=1:2(hw) 
CoresPerSocket=2:1(hw)
slurmd: error: cannot read 
(null)/user.slice/user-0.slice/session-4.scope/cgroup.controllers: No 
such file or directory
slurmd: error: Couldn't load specified plugin name for cgroup/v2: Plugin 
init() callback failed
slurmd: error: cannot create cgroup context for cgroup/v2
slurmd: error: Unable to initialize cgroup plugin
slurmd: error: slurmd initialization failed

Replace the slurm.conf and it starts without problems.

Currently in exec1:/etc/slurm I have :

cgroup.conf :

CgroupAutomount=yes
CgroupMountpoint=/sys/fs/cgroup
ConstrainCores=yes
ConstrainDevices=yes
ConstrainKmemSpace=no        #avoid known Kernel issues
ConstrainRAMSpace=yes
ConstrainSwapSpace=yes

plugstack.conf :

required        auto_tmpdir.so          mount=/tmp mount=/var/tmp

The slurm.conf from the slurmctld machines is below, I've snipped any 
commented lines to save spece.

# slurm.conf file generated by configurator.html.
ClusterName=cluster
SlurmctldHost=frontend
SlurmctldHost=frontback

SlurmctldParameters=enable_configless
JobSubmitPlugins=lua
MpiDefault=none
PlugStackConfig=/etc/slurm/plugstack.conf
ProctrackType=proctrack/cgroup
ReturnToService=1
SlurmctldPidFile=/var/run/slurmctld.pid
SlurmctldPort=6817
SlurmdPidFile=/var/run/slurmd.pid
SlurmdPort=6818
SlurmdSpoolDir=/var/spool/slurmd
SlurmUser=slurmd
SrunPortRange=60001-63000

StateSaveLocation=/usr/local/spool/slurmctld
SwitchType=switch/none
TaskPlugin=task/affinity

InactiveLimit=0
KillWait=30
MinJobAge=300
SlurmctldTimeout=120
SlurmdTimeout=300
Waittime=0

SchedulerType=sched/backfill
SelectType=select/cons_tres
SelectTypeParameters=CR_Core

PriorityType=priority/multifactor
PriorityDecayHalfLife=14-0
PriorityUsageResetPeriod=NONE
PriorityWeightAge=1000
PriorityWeightFairshare=100000

AccountingStorageEnforce=associations,limits
AccountingStorageType=accounting_storage/slurmdbd
JobCompLoc=/var/log/slurm/joblog.txt
JobCompType=jobcomp/filetext
JobContainerType=job_container/none
JobAcctGatherFrequency=30
JobAcctGatherType=jobacct_gather/cgroup
SlurmctldDebug=info
SlurmctldLogFile=/var/log/slurmctld.log
SlurmdDebug=info
SlurmdLogFile=/var/log/slurmd.log

NodeName=exec1.cluster.local CPUs=2 RealMemory=7168 Sockets=1 
CoresPerSocket=2 ThreadsPerCore=1 State=UNKNOWN
NodeName=exec2.cluster.local CPUs=2 RealMemory=7168 Sockets=1 
CoresPerSocket=2 ThreadsPerCore=1 State=UNKNOWN
NodeName=exec3.cluster.local CPUs=2 RealMemory=7168 Sockets=1 
CoresPerSocket=2 ThreadsPerCore=1 State=UNKNOWN

PartitionName=local 
Nodes=exec1.cluster.local,exec2.cluster.local,exec3.cluster.local 
Default=Yes

NodeName=execr1.cluster.local CPUs=2 RealMemory=7168 Sockets=1 
CoresPerSocket=2 ThreadsPerCore=1 State=UNKNOWN
NodeName=execr2.cluster.local CPUs=2 RealMemory=7168 Sockets=1 
CoresPerSocket=2 ThreadsPerCore=1 State=UNKNOWN
NodeName=execr3.cluster.local CPUs=2 RealMemory=7168 Sockets=1 
CoresPerSocket=2 ThreadsPerCore=1 State=UNKNOWN

PartitionName=remote 
Nodes=execr1.cluster.local,execr2.cluster.local,execr3.cluster.local 
Default=no

NodeName=execd1.cluster.local CPUs=2 RealMemory=7168 Sockets=1 
CoresPerSocket=2 ThreadsPerCore=1 State=UNKNOWN
NodeName=execd2.cluster.local CPUs=2 RealMemory=7168 Sockets=1 
CoresPerSocket=2 ThreadsPerCore=1 State=UNKNOWN
NodeName=execd3.cluster.local CPUs=2 RealMemory=7168 Sockets=1 
CoresPerSocket=2 ThreadsPerCore=1 State=UNKNOWN

PartitionName=dragon 
Nodes=execd1.cluster.local,execd2.cluster.local,execd3.cluster.local 
Default=no qos=part_dragon

Anyone have any idea what the problem could be?

Cheers.

Phill.