[slurm-users] Federation problems

rapier rapier at psc.edu
Fri Dec 7 15:14:08 MST 2018


Hello all,

I'm relatively new to slurm but I've tasked with looking at how slum 
federation might work. I've set up two very small slurm clusters. Both 
of them seem to work individually quite well. However, when I try to set 
things up to do federation it rapidly breaks down.

 From what I've been able to figure out I need to edit the slurm.conf on 
the remote cluster so that AccountingStorgaeHost points to the master 
controller host (which is also running slurmdbd).

However, I'm getting a persistent error of

slurmctld: error: Malformed RPC of type PERSIST_RC(1433) received
slurmctld: error: slurm_persist_conn_open: Failed to unpack persistent 
connection init resp message from 128.182.xx.yy:6819
slurmctld: error: slurmdbd: Sending PersistInit msg: No error

I also get this error when I try to run sacctmgr

I've looked through the logs and I can't find anything useful at this 
time. I'm sure this is a configuration issue but I'm not sure where to 
start. I have verified that the remote cluster slurm user and root have 
passwordless access to the mariadb server. I've even made sure the 
clusters share the same munge.key.

If anyone has a clue I'd appreciate it.

Thanks,

Chris

I've included the slurm.conf files of the two controllers below

[mastercluster slurm.conf]
# slurm.conf file generated by configurator easy.html.
# Put this file on all nodes of your cluster.
# See the slurm.conf man page for more information.
#
SlurmctldHost=cvm0
NodeName=cvm0
#
#MailProg=/bin/mail
MpiDefault=none
#MpiParams=ports=#-#
ProctrackType=proctrack/cgroup
ReturnToService=1
SlurmctldPidFile=/var/run/slurmctld.pid
#SlurmctldPort=6817
SlurmdPidFile=/var/run/slurmd.pid
#SlurmdPort=6818
SlurmdSpoolDir=/var/spool/slurmd
SlurmUser=slurm
#SlurmdUser=root
StateSaveLocation=/var/spool/slurmctld
SwitchType=switch/none
TaskPlugin=task/affinity
#
#
# TIMERS
#KillWait=30
#MinJobAge=300
#SlurmctldTimeout=120
#SlurmdTimeout=300
#
#
# SCHEDULING
FastSchedule=1
SchedulerType=sched/backfill
SelectType=select/cons_res
SelectTypeParameters=CR_Core
#
#
# LOGGING AND ACCOUNTING
AccountingStorageType=accounting_storage/slurmdbd
ClusterName=masterCluster
#JobAcctGatherFrequency=30
JobAcctGatherType=jobacct_gather/linux
JobAcctGatherFrequency=30
AccountingStoreJobComment=YES
#SlurmctldDebug=3
SlurmctldLogFile=/var/log/slurmctld.log
#SlurmdDebug=3
SlurmdLogFile=/var/log/slurmd.log
#
#
# COMPUTE NODES
NodeName=cvm1 NodeHostName=cvm1 NodeAddr= 128.182.xx.xx CPUs=1 
RealMemory=1024 Sockets=1 CoresPerSocket=1 ThreadsPerCore=1 State=UNKNOWN
PartitionName=debug Nodes=cvm1 Default=YES MaxTime=INFINITE State=UP

[slavecluster slurm.conf]
# slurm.conf file generated by configurator easy.html.
# Put this file on all nodes of your cluster.
# See the slurm.conf man page for more information.
#
SlurmctldHost=cvm2
NodeName=cvm2
#
#MailProg=/bin/mail
MpiDefault=none
#MpiParams=ports=#-#
ProctrackType=proctrack/cgroup
ReturnToService=1
SlurmctldPidFile=/var/run/slurmctld.pid
#SlurmctldPort=6817
SlurmdPidFile=/var/run/slurmd.pid
#SlurmdPort=6818
SlurmdSpoolDir=/var/spool/slurmd
SlurmUser=slurm
#SlurmdUser=root
StateSaveLocation=/var/spool/slurmctld
SwitchType=switch/none
TaskPlugin=task/affinity
#
#
# TIMERS
#KillWait=30
#MinJobAge=300
#SlurmctldTimeout=120
#SlurmdTimeout=300
#
#
# SCHEDULING
FastSchedule=1
SchedulerType=sched/backfill
SelectType=select/cons_res
SelectTypeParameters=CR_Core
#
#
# LOGGING AND ACCOUNTING
AccountingStorageType=accounting_storage/slurmdbd
AccountingStorageHost=128.182.xx.yy
ClusterName=slaveCluster
JobAcctGatherType=jobacct_gather/linux
JobAcctGatherFrequency=30
#SlurmctldDebug=3
SlurmctldLogFile=/var/log/slurmctld.log
#SlurmdDebug=3
SlurmdLogFile=/var/log/slurmd.log
#
#
# COMPUTE NODES
NodeName=cvm3 NodeHostName=cvm3 NodeAddr= 128.182.xx.aa CPUs=1 
RealMemory=1024 Sockets=1 CoresPerSocket=1 ThreadsPerCore=1 State=UNKNOWN
NodeName=cvm4 NodeHostName=cvm4 NodeAddr= 128.182.xx.bb CPUs=1 
RealMemory=1024 Sockets=1 CoresPerSocket=1 ThreadsPerCore=1 State=UNKNOWN
PartitionName=debug Nodes=cvm[3-4] Default=YES MaxTime=INFINITE State=UP


[slurmdbd.conf from slave cluster]
#
# Example slurmdbd.conf file.
#
# See the slurmdbd.conf man page for more information.
#
# Archive info
#ArchiveJobs=yes
#ArchiveDir="/tmp"
#ArchiveSteps=yes
#ArchiveScript=
#JobPurge=12
#StepPurge=1
#
# Authentication info
AuthType=auth/munge
#AuthInfo=/var/run/munge/munge.socket.2
#
# slurmDBD info
DbdAddr=128.182.xx.yy
DbdHost=cvm0.[FQDN]
#DbdPort=7031
SlurmUser=slurm
#MessageTimeout=300
DebugLevel=4
#DefaultQOS=normal,standby
LogFile=/var/log/slurmdbd.log
PidFile=/var/run/slurmdbd.pid
#PluginDir=/usr/lib/slurm
#PrivateData=accounts,users,usage,jobs
#TrackWCKey=yes
#
# Database info
StorageType=accounting_storage/mysql
#StorageHost=localhost
#StoragePort=1234
StoragePass=password
StorageUser=slurm
StorageLoc=slurm_acct_db




More information about the slurm-users mailing list