[slurm-users] Configless node problems.....
Phill Harvey-Smith
p.harvey-smith at warwick.ac.uk
Fri Mar 17 17:21:47 UTC 2023
Hi all,
In preparation for deployment in a real world system, I have been trying
things out on a set of virtual machines arranged as a cluster. One of
the things I am trying to implement is configless nodes.
I currently have my virtual cluster setup as so :
frontend, frontback
both run slurmctld & slurmdbd
backend
has DNS, user management, shared filesystems and mariadb.
exec1-exec3 local partition nodes
execd1-execd3 dragon partition nodes
execr1-execr3 remote partition nodes
The cluster is spread across 3 physical machines, all linked by a tinc vpn.
This works without problems in the traditional slurm.conf on all nodes
configuration. Users can submit jobs and they are executed as requested.
All nodes are Rocky Linux 9, slurm 22.05.2.
So to test the configless setup I have done the following :
1) added SlurmctldParameters=enable_configless to slurm.conf on frontend
& frontback.
Then restarted slurmctld with systemctl restart slurmctld.service
2) added :
_slurmctld._tcp 3600 IN SRV 10 0 6817 frontend
_slurmctld._tcp 3600 IN SRV 0 0 6817 frontback
To the forward lookup (host->ip) file of the DNS server on backend and
restarted it.
3) removed /etc/slurm/slurm.conf on exec1, and attempted to re-start it
in configless mode, this is where I get the problem......
Attempting to start slurmd causes it to fail, if I run it in debug mode
I get :
[root at exec1 slurm]# slurmd -D -vv
slurmd: debug: Log file re-opened
slurmd: debug: CPUs:2 Boards:1 Sockets:2 CoresPerSocket:1 ThreadsPerCore:1
slurmd: Node reconfigured socket/core boundaries SocketsPerBoard=1:2(hw)
CoresPerSocket=2:1(hw)
slurmd: error: cannot read
(null)/user.slice/user-0.slice/session-4.scope/cgroup.controllers: No
such file or directory
slurmd: error: Couldn't load specified plugin name for cgroup/v2: Plugin
init() callback failed
slurmd: error: cannot create cgroup context for cgroup/v2
slurmd: error: Unable to initialize cgroup plugin
slurmd: error: slurmd initialization failed
Replace the slurm.conf and it starts without problems.
Currently in exec1:/etc/slurm I have :
cgroup.conf :
CgroupAutomount=yes
CgroupMountpoint=/sys/fs/cgroup
ConstrainCores=yes
ConstrainDevices=yes
ConstrainKmemSpace=no #avoid known Kernel issues
ConstrainRAMSpace=yes
ConstrainSwapSpace=yes
plugstack.conf :
required auto_tmpdir.so mount=/tmp mount=/var/tmp
The slurm.conf from the slurmctld machines is below, I've snipped any
commented lines to save spece.
# slurm.conf file generated by configurator.html.
ClusterName=cluster
SlurmctldHost=frontend
SlurmctldHost=frontback
SlurmctldParameters=enable_configless
JobSubmitPlugins=lua
MpiDefault=none
PlugStackConfig=/etc/slurm/plugstack.conf
ProctrackType=proctrack/cgroup
ReturnToService=1
SlurmctldPidFile=/var/run/slurmctld.pid
SlurmctldPort=6817
SlurmdPidFile=/var/run/slurmd.pid
SlurmdPort=6818
SlurmdSpoolDir=/var/spool/slurmd
SlurmUser=slurmd
SrunPortRange=60001-63000
StateSaveLocation=/usr/local/spool/slurmctld
SwitchType=switch/none
TaskPlugin=task/affinity
InactiveLimit=0
KillWait=30
MinJobAge=300
SlurmctldTimeout=120
SlurmdTimeout=300
Waittime=0
SchedulerType=sched/backfill
SelectType=select/cons_tres
SelectTypeParameters=CR_Core
PriorityType=priority/multifactor
PriorityDecayHalfLife=14-0
PriorityUsageResetPeriod=NONE
PriorityWeightAge=1000
PriorityWeightFairshare=100000
AccountingStorageEnforce=associations,limits
AccountingStorageType=accounting_storage/slurmdbd
JobCompLoc=/var/log/slurm/joblog.txt
JobCompType=jobcomp/filetext
JobContainerType=job_container/none
JobAcctGatherFrequency=30
JobAcctGatherType=jobacct_gather/cgroup
SlurmctldDebug=info
SlurmctldLogFile=/var/log/slurmctld.log
SlurmdDebug=info
SlurmdLogFile=/var/log/slurmd.log
NodeName=exec1.cluster.local CPUs=2 RealMemory=7168 Sockets=1
CoresPerSocket=2 ThreadsPerCore=1 State=UNKNOWN
NodeName=exec2.cluster.local CPUs=2 RealMemory=7168 Sockets=1
CoresPerSocket=2 ThreadsPerCore=1 State=UNKNOWN
NodeName=exec3.cluster.local CPUs=2 RealMemory=7168 Sockets=1
CoresPerSocket=2 ThreadsPerCore=1 State=UNKNOWN
PartitionName=local
Nodes=exec1.cluster.local,exec2.cluster.local,exec3.cluster.local
Default=Yes
NodeName=execr1.cluster.local CPUs=2 RealMemory=7168 Sockets=1
CoresPerSocket=2 ThreadsPerCore=1 State=UNKNOWN
NodeName=execr2.cluster.local CPUs=2 RealMemory=7168 Sockets=1
CoresPerSocket=2 ThreadsPerCore=1 State=UNKNOWN
NodeName=execr3.cluster.local CPUs=2 RealMemory=7168 Sockets=1
CoresPerSocket=2 ThreadsPerCore=1 State=UNKNOWN
PartitionName=remote
Nodes=execr1.cluster.local,execr2.cluster.local,execr3.cluster.local
Default=no
NodeName=execd1.cluster.local CPUs=2 RealMemory=7168 Sockets=1
CoresPerSocket=2 ThreadsPerCore=1 State=UNKNOWN
NodeName=execd2.cluster.local CPUs=2 RealMemory=7168 Sockets=1
CoresPerSocket=2 ThreadsPerCore=1 State=UNKNOWN
NodeName=execd3.cluster.local CPUs=2 RealMemory=7168 Sockets=1
CoresPerSocket=2 ThreadsPerCore=1 State=UNKNOWN
PartitionName=dragon
Nodes=execd1.cluster.local,execd2.cluster.local,execd3.cluster.local
Default=no qos=part_dragon
Anyone have any idea what the problem could be?
Cheers.
Phill.
More information about the slurm-users
mailing list