[slurm-users] Federation problems

rapier rapier at psc.edu
Fri Dec 7 15:18:49 MST 2018


And nevermind.

I didn't restart munge after assuring both clusters had the same key. 
This specific problem was cleared up after I did that...


Chris

On 12/7/18 5:14 PM, rapier wrote:
> Hello all,
> 
> I'm relatively new to slurm but I've tasked with looking at how slum 
> federation might work. I've set up two very small slurm clusters. Both 
> of them seem to work individually quite well. However, when I try to set 
> things up to do federation it rapidly breaks down.
> 
>  From what I've been able to figure out I need to edit the slurm.conf on 
> the remote cluster so that AccountingStorgaeHost points to the master 
> controller host (which is also running slurmdbd).
> 
> However, I'm getting a persistent error of
> 
> slurmctld: error: Malformed RPC of type PERSIST_RC(1433) received
> slurmctld: error: slurm_persist_conn_open: Failed to unpack persistent 
> connection init resp message from 128.182.xx.yy:6819
> slurmctld: error: slurmdbd: Sending PersistInit msg: No error
> 
> I also get this error when I try to run sacctmgr
> 
> I've looked through the logs and I can't find anything useful at this 
> time. I'm sure this is a configuration issue but I'm not sure where to 
> start. I have verified that the remote cluster slurm user and root have 
> passwordless access to the mariadb server. I've even made sure the 
> clusters share the same munge.key.
> 
> If anyone has a clue I'd appreciate it.
> 
> Thanks,
> 
> Chris
> 
> I've included the slurm.conf files of the two controllers below
> 
> [mastercluster slurm.conf]
> # slurm.conf file generated by configurator easy.html.
> # Put this file on all nodes of your cluster.
> # See the slurm.conf man page for more information.
> #
> SlurmctldHost=cvm0
> NodeName=cvm0
> #
> #MailProg=/bin/mail
> MpiDefault=none
> #MpiParams=ports=#-#
> ProctrackType=proctrack/cgroup
> ReturnToService=1
> SlurmctldPidFile=/var/run/slurmctld.pid
> #SlurmctldPort=6817
> SlurmdPidFile=/var/run/slurmd.pid
> #SlurmdPort=6818
> SlurmdSpoolDir=/var/spool/slurmd
> SlurmUser=slurm
> #SlurmdUser=root
> StateSaveLocation=/var/spool/slurmctld
> SwitchType=switch/none
> TaskPlugin=task/affinity
> #
> #
> # TIMERS
> #KillWait=30
> #MinJobAge=300
> #SlurmctldTimeout=120
> #SlurmdTimeout=300
> #
> #
> # SCHEDULING
> FastSchedule=1
> SchedulerType=sched/backfill
> SelectType=select/cons_res
> SelectTypeParameters=CR_Core
> #
> #
> # LOGGING AND ACCOUNTING
> AccountingStorageType=accounting_storage/slurmdbd
> ClusterName=masterCluster
> #JobAcctGatherFrequency=30
> JobAcctGatherType=jobacct_gather/linux
> JobAcctGatherFrequency=30
> AccountingStoreJobComment=YES
> #SlurmctldDebug=3
> SlurmctldLogFile=/var/log/slurmctld.log
> #SlurmdDebug=3
> SlurmdLogFile=/var/log/slurmd.log
> #
> #
> # COMPUTE NODES
> NodeName=cvm1 NodeHostName=cvm1 NodeAddr= 128.182.xx.xx CPUs=1 
> RealMemory=1024 Sockets=1 CoresPerSocket=1 ThreadsPerCore=1 State=UNKNOWN
> PartitionName=debug Nodes=cvm1 Default=YES MaxTime=INFINITE State=UP
> 
> [slavecluster slurm.conf]
> # slurm.conf file generated by configurator easy.html.
> # Put this file on all nodes of your cluster.
> # See the slurm.conf man page for more information.
> #
> SlurmctldHost=cvm2
> NodeName=cvm2
> #
> #MailProg=/bin/mail
> MpiDefault=none
> #MpiParams=ports=#-#
> ProctrackType=proctrack/cgroup
> ReturnToService=1
> SlurmctldPidFile=/var/run/slurmctld.pid
> #SlurmctldPort=6817
> SlurmdPidFile=/var/run/slurmd.pid
> #SlurmdPort=6818
> SlurmdSpoolDir=/var/spool/slurmd
> SlurmUser=slurm
> #SlurmdUser=root
> StateSaveLocation=/var/spool/slurmctld
> SwitchType=switch/none
> TaskPlugin=task/affinity
> #
> #
> # TIMERS
> #KillWait=30
> #MinJobAge=300
> #SlurmctldTimeout=120
> #SlurmdTimeout=300
> #
> #
> # SCHEDULING
> FastSchedule=1
> SchedulerType=sched/backfill
> SelectType=select/cons_res
> SelectTypeParameters=CR_Core
> #
> #
> # LOGGING AND ACCOUNTING
> AccountingStorageType=accounting_storage/slurmdbd
> AccountingStorageHost=128.182.xx.yy
> ClusterName=slaveCluster
> JobAcctGatherType=jobacct_gather/linux
> JobAcctGatherFrequency=30
> #SlurmctldDebug=3
> SlurmctldLogFile=/var/log/slurmctld.log
> #SlurmdDebug=3
> SlurmdLogFile=/var/log/slurmd.log
> #
> #
> # COMPUTE NODES
> NodeName=cvm3 NodeHostName=cvm3 NodeAddr= 128.182.xx.aa CPUs=1 
> RealMemory=1024 Sockets=1 CoresPerSocket=1 ThreadsPerCore=1 State=UNKNOWN
> NodeName=cvm4 NodeHostName=cvm4 NodeAddr= 128.182.xx.bb CPUs=1 
> RealMemory=1024 Sockets=1 CoresPerSocket=1 ThreadsPerCore=1 State=UNKNOWN
> PartitionName=debug Nodes=cvm[3-4] Default=YES MaxTime=INFINITE State=UP
> 
> 
> [slurmdbd.conf from slave cluster]
> #
> # Example slurmdbd.conf file.
> #
> # See the slurmdbd.conf man page for more information.
> #
> # Archive info
> #ArchiveJobs=yes
> #ArchiveDir="/tmp"
> #ArchiveSteps=yes
> #ArchiveScript=
> #JobPurge=12
> #StepPurge=1
> #
> # Authentication info
> AuthType=auth/munge
> #AuthInfo=/var/run/munge/munge.socket.2
> #
> # slurmDBD info
> DbdAddr=128.182.xx.yy
> DbdHost=cvm0.[FQDN]
> #DbdPort=7031
> SlurmUser=slurm
> #MessageTimeout=300
> DebugLevel=4
> #DefaultQOS=normal,standby
> LogFile=/var/log/slurmdbd.log
> PidFile=/var/run/slurmdbd.pid
> #PluginDir=/usr/lib/slurm
> #PrivateData=accounts,users,usage,jobs
> #TrackWCKey=yes
> #
> # Database info
> StorageType=accounting_storage/mysql
> #StorageHost=localhost
> #StoragePort=1234
> StoragePass=password
> StorageUser=slurm
> StorageLoc=slurm_acct_db
> 



More information about the slurm-users mailing list