[slurm-users] Slurm not starting

Elisabetta Falivene e.falivene at ilabroma.com
Tue Jan 16 08:32:47 MST 2018


Here is the solution and another (minor) problem!

Investigating in the direction of the pid problem I found that in the
setting there was a
*SlurmctldPidFile=/var/run/slurmctld.pid*
*SlurmdPidFile=/var/run/slurmd.pid*

but the pid was searched in /var/run/slurm-llnl so I changed in the
slurm.conf of the master AND the nodes
*SlurmctldPidFile=/var/run/slurm-llnl/slurmctld.pid*
*SlurmdPidFile=/var/run/slurm-llnl/slurmd.pid*

being again able to launch slurmctld on the master and slurmd on the nodes.

At this point the nodes were all set to drain automatically giving an error
like

*error: Node node01 has low real_memory size (15999 < 16000)*

so it was necessary to change in the slurm.conf (master and nodes)
*NodeName=node[01-08] CPUs=16 RealMemory=16000 State=UNKNOWN*
to
*NodeName=node[01-08] CPUs=16 RealMemory=15999 State=UNKNOWN*

Now, slurm works and the nodes are running. There is only one minor problem

*error: Node node04 has low real_memory size (7984 < 15999)*
*error: Node node02 has low real_memory size (3944 < 15999)*

Two nodes are still put to drain state. The nodes suffered a physical
damage to some rams and I had to physically remove them, so slurm think it
is not a good idea to use them.
It is possibile to make slurm use the node anyway? I know I can scontrol
update NodeName=node04 State=RESUME and put back the node to idle state,
but as the machine is rebooted or the service restarted it would be set to
drain again.

Thank you for your help!
b

2018-01-16 13:25 GMT+01:00 Elisabetta Falivene <e.falivene at ilabroma.com>:

>
> It seems like the pidfile in systemd and slurm.conf are different. Check
>> if they are the same and if not adjust the slurm.conf pid files. That
>> should prevent systemd from killing slurm.
>>
> Emh, sorry, how I can do this?
>
>
>
>> On Mon, 15 Jan 2018, 18:24 Elisabetta Falivene, <e.falivene at ilabroma.com>
>> wrote:
>>
>>> The deeper I go in the problem, the worser it seems... but maybe I'm a
>>> step closer to the solution.
>>>
>>> I discovered that munge was disabled on the nodes (my fault, Gennaro
>>> pointed out the problem before, but I enabled it back only on the master).
>>> Btw, it's very strange that the wheezy->jessie upgrade disabled munge on
>>> all nodes and master...
>>>
>>> Unfortunately, re-enabling munge on the nodes, didn't made slurmd start
>>> again.
>>>
>>> Maybe filling this setting could give me some info about the problem?
>>> *#SlurmdLogFile=*
>>>
>>> Thank you very much for your help. Is very precious to me.
>>> betta
>>>
>>> Ps: some test I made ->
>>>
>>> Running on the nodes
>>>
>>> *slurm -Dvvv*
>>>
>>> returns
>>>
>>> *slurmd: debug2: hwloc_topology_init*
>>> *slurmd: debug2: hwloc_topology_load*
>>> *slurmd: Considering each NUMA node as a socket*
>>> *slurmd: debug:  CPUs:16 Boards:1 Sockets:4 CoresPerSocket:4
>>> ThreadsPerCore:1*
>>> *slurmd: Node configuration differs from hardware: CPUs=16:16(hw)
>>> Boards=1:1(hw) SocketsPerBoard=16:4(hw) CoresPerSocket=1:4(hw)
>>> ThreadsPerCore=1:1(hw)*
>>> *slurmd: topology NONE plugin loaded*
>>> *slurmd: Gathering cpu frequency information for 16 cpus*
>>> *slurmd: debug:  Reading cgroup.conf file /etc/slurm-llnl/cgroup.conf*
>>> *slurmd: debug2: hwloc_topology_init*
>>> *slurmd: debug2: hwloc_topology_load*
>>> *slurmd: Considering each NUMA node as a socket*
>>> *slurmd: debug:  CPUs:16 Boards:1 Sockets:4 CoresPerSocket:4
>>> ThreadsPerCore:1*
>>> *slurmd: debug:  Reading cgroup.conf file /etc/slurm-llnl/cgroup.conf*
>>> *slurmd: debug:  task/cgroup: now constraining jobs allocated cores*
>>> *slurmd: task/cgroup: loaded*
>>> *slurmd: auth plugin for Munge (http://code.google.com/p/munge/
>>> <http://code.google.com/p/munge/>) loaded*
>>> *slurmd: debug:  spank: opening plugin stack
>>> /etc/slurm-llnl/plugstack.conf*
>>> *slurmd: Munge cryptographic signature plugin loaded*
>>> *slurmd: Warning: Core limit is only 0 KB*
>>> *slurmd: slurmd version 14.03.9 started*
>>> *slurmd: Job accounting gather LINUX plugin loaded*
>>> *slurmd: debug:  job_container none plugin loaded*
>>> *slurmd: switch NONE plugin loaded*
>>> *slurmd: slurmd started on Mon, 15 Jan 2018 18:07:17 +0100*
>>> *slurmd: CPUs=16 Boards=1 Sockets=16 Cores=1 Threads=1 Memory=15999
>>> TmpDisk=40189 Uptime=1254*
>>> *slurmd: AcctGatherEnergy NONE plugin loaded*
>>> *slurmd: AcctGatherProfile NONE plugin loaded*
>>> *slurmd: AcctGatherInfiniband NONE plugin loaded*
>>> *slurmd: AcctGatherFilesystem NONE plugin loaded*
>>> *slurmd: debug2: No acct_gather.conf file
>>> (/etc/slurm-llnl/acct_gather.conf)*
>>> *slurmd: debug2: _slurm_connect failed: Connection refused*
>>> *slurmd: debug2: Error connecting slurm stream socket at
>>> 192.168.1.1:6817 <http://192.168.1.1:6817>: Connection refused*
>>> *slurmd: debug:  Failed to contact primary controller: Connection
>>> refused*
>>> *slurmd: debug2: _slurm_connect failed: Connection refused*
>>> *slurmd: debug2: Error connecting slurm stream socket at
>>> 192.168.1.1:6817 <http://192.168.1.1:6817>: Connection refused*
>>> *slurmd: debug:  Failed to contact primary controller: Connection
>>> refused*
>>> *slurmd: debug2: _slurm_connect failed: Connection refused*
>>> *slurmd: debug2: Error connecting slurm stream socket at
>>> 192.168.1.1:6817 <http://192.168.1.1:6817>: Connection refused*
>>> *slurmd: debug:  Failed to contact primary controller: Connection
>>> refused*
>>> *slurmd: debug2: _slurm_connect failed: Connection refused*
>>> *slurmd: debug2: Error connecting slurm stream socket at
>>> 192.168.1.1:6817 <http://192.168.1.1:6817>: Connection refused*
>>> *slurmd: debug:  Failed to contact primary controller: Connection
>>> refused*
>>> *slurmd: debug2: _slurm_connect failed: Connection refused*
>>> *slurmd: debug2: Error connecting slurm stream socket at
>>> 192.168.1.1:6817 <http://192.168.1.1:6817>: Connection refused*
>>> *slurmd: debug:  Failed to contact primary controller: Connection
>>> refused*
>>> *^Cslurmd: got shutdown request*
>>> *slurmd: waiting on 1 active threads*
>>> *slurmd: debug2: _slurm_connect failed: Connection refused*
>>> *slurmd: debug2: Error connecting slurm stream socket at
>>> 192.168.1.1:6817 <http://192.168.1.1:6817>: Connection refused*
>>> *slurmd: debug:  Failed to contact primary controller: Connection
>>> refused*
>>> *slurmd: debug2: _slurm_connect failed: Connection refused*
>>> *slurmd: debug2: Error connecting slurm stream socket at
>>> 192.168.1.1:6817 <http://192.168.1.1:6817>: Connection refused*
>>> *slurmd: debug:  Failed to contact primary controller: Connection
>>> refused*
>>> *©©slurmd: debug2: _slurm_connect failed: Connection refused*
>>> *slurmd: debug2: Error connecting slurm stream socket at
>>> 192.168.1.1:6817 <http://192.168.1.1:6817>: Connection refused*
>>> *slurmd: debug:  Failed to contact primary controller: Connection
>>> refused*
>>> *^C^C^C^Cslurmd: debug2: _slurm_connect failed: Connection refused*
>>> *slurmd: debug2: Error connecting slurm stream socket at
>>> 192.168.1.1:6817 <http://192.168.1.1:6817>: Connection refused*
>>> *slurmd: debug:  Failed to contact primary controller: Connection
>>> refused*
>>> *slurmd: debug2: _slurm_connect failed: Connection refused*
>>> *slurmd: debug2: Error connecting slurm stream socket at
>>> 192.168.1.1:6817 <http://192.168.1.1:6817>: Connection refused*
>>> *slurmd: debug:  Failed to contact primary controller: Connection
>>> refused*
>>> *slurmd: error: Unable to register: Unable to contact slurm controller
>>> (connect failure)*
>>> *slurmd: debug:  Unable to register with slurm controller, retrying*
>>> *slurmd: all threads complete*
>>> *slurmd: Consumable Resources (CR) Node Selection plugin shutting down
>>> ...*
>>> *slurmd: Munge cryptographic signature plugin unloaded*
>>> *slurmd: Slurmd shutdown completing*
>>>
>>> which maybe it is not so bad as it seems for it may only point out that
>>> slurm is not up on the master, isn't?
>>>
>>> On the master running
>>>
>>> *service slurmctld restart*
>>>
>>> returns
>>>
>>> *Job for slurmctld.service failed. See 'systemctl status
>>> slurmctld.service' and 'journalctl -xn' for details.*
>>>
>>> and
>>>
>>> *service slurmctld status*
>>>
>>> *returns*
>>>
>>> *slurmctld.service - Slurm controller daemon*
>>> *   Loaded: loaded (/lib/systemd/system/slurmctld.service; enabled)*
>>> *   Active: failed (Result: timeout) since Mon 2018-01-15 18:11:20 CET;
>>> 44s ago*
>>> *  Process: 2223 ExecStart=/usr/sbin/slurmctld $SLURMCTLD_OPTIONS
>>> (code=exited, status=0/SUCCESS)*
>>>
>>> * slurmctld[2225]: cons_res: select_p_reconfigure*
>>> * slurmctld[2225]: cons_res: select_p_node_init*
>>> * slurmctld[2225]: cons_res: preparing for 1 partitions*
>>> * slurmctld[2225]: Running as primary controller*
>>> * slurmctld[2225]:
>>> SchedulerParameters=default_queue_depth=100,max_rpc_cnt=0,max_sched_time=4,partition_job_depth=0*
>>> * systemd[1]: slurmctld.service start operation timed out. Terminating.*
>>> * slurmctld[2225]: Terminate signal (SIGINT or SIGTERM) received*
>>> * slurmctld[2225]: Saving all slurm state*
>>> * systemd[1]: Failed to start Slurm controller daemon.*
>>> * systemd[1]: Unit slurmctld.service entered failed state.*
>>>
>>> and
>>> *journalctl -xn*
>>>
>>> returns no visible error
>>>
>>> *-- Logs begin at Mon 2018-01-15 18:04:38 CET, end at Mon 2018-01-15
>>> 18:17:33 CET. --*
>>> *Jan 15 18:16:23 anyone.phys.uniroma1.it
>>> <http://anyone.phys.uniroma1.it> slurmctld[2286]: Saving all slurm state*
>>> *Jan 15 18:16:23 anyone.phys.uniroma1.it
>>> <http://anyone.phys.uniroma1.it> systemd[1]: Failed to start Slurm
>>> controller daemon.*
>>> *-- Subject: Unit slurmctld.service has failed*
>>> *-- Defined-By: systemd*
>>> *-- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel
>>> <http://lists.freedesktop.org/mailman/listinfo/systemd-devel>*
>>> *-- *
>>> *-- Unit slurmctld.service has failed.*
>>> *-- *
>>> *-- The result is failed.*
>>> * systemd[1]: Unit slurmctld.service entered failed state.*
>>> * CRON[2312]: pam_unix(cron:session): session opened for user root by
>>> (uid=0)*
>>> *CRON[2313]: (root) CMD (   cd / && run-parts --report /etc/cron.hourly)*
>>> *CRON[2312]: pam_unix(cron:session): session closed for user root*
>>> *dhcpd[1538]: DHCPREQUEST for 192.168.1.101 from c8:60:00:32:c6:c4 via
>>> eth1*
>>> * dhcpd[1538]: DHCPACK on 192.168.1.101 to c8:60:00:32:c6:c4 via eth1*
>>> *dhcpd[1538]: DHCPREQUEST for 192.168.1.102 from bc:ae:c5:12:97:75 via
>>> eth1*
>>> *dhcpd[1538]: DHCPACK on 192.168.1.102 to bc:ae:c5:12:97:75 via eth1*
>>>
>>> 2018-01-15 16:56 GMT+01:00 Carlos Fenoy <minibit at gmail.com>:
>>>
>>>> Hi,
>>>>
>>>> you can not start the slurmd on the headnode. Try running the same
>>>> command on the compute nodes and check the output. If there is any issue it
>>>> should display the reason.
>>>>
>>>> Regards,
>>>> Carlos
>>>>
>>>> On Mon, Jan 15, 2018 at 4:50 PM, Elisabetta Falivene <
>>>> e.falivene at ilabroma.com> wrote:
>>>>
>>>>> In the headnode. (I'm also noticing, and seems good to tell, for maybe
>>>>> the problem is the same, even ldap is not working as expected giving a
>>>>> message "invalid credential (49)" which is a message given when there are
>>>>> problem of this type. The update to jessie must have touched something that
>>>>> is affecting all my software sanity :D )
>>>>>
>>>>> Here is the my slurm.conf.
>>>>>
>>>>> # slurm.conf file generated by configurator.html.
>>>>> # Put this file on all nodes of your cluster.
>>>>> # See the slurm.conf man page for more information.
>>>>> #
>>>>> ControlMachine=anyone
>>>>> ControlAddr=master
>>>>> #BackupController=
>>>>> #BackupAddr=
>>>>> #
>>>>> AuthType=auth/munge
>>>>> CacheGroups=0
>>>>> #CheckpointType=checkpoint/none
>>>>> CryptoType=crypto/munge
>>>>> #DisableRootJobs=NO
>>>>> #EnforcePartLimits=NO
>>>>> #Epilog=
>>>>> #EpilogSlurmctld=
>>>>> #FirstJobId=1
>>>>> #MaxJobId=999999
>>>>> #GresTypes=
>>>>> #GroupUpdateForce=0
>>>>> #GroupUpdateTime=600
>>>>> #JobCheckpointDir=/var/slurm/checkpoint
>>>>> #JobCredentialPrivateKey=
>>>>> #JobCredentialPublicCertificate=
>>>>> #JobFileAppend=0
>>>>> #JobRequeue=1
>>>>> #JobSubmitPlugins=1
>>>>> #KillOnBadExit=0
>>>>> #Licenses=foo*4,bar
>>>>> #MailProg=/bin/mail
>>>>> #MaxJobCount=5000
>>>>> #MaxStepCount=40000
>>>>> #MaxTasksPerNode=128
>>>>> MpiDefault=openmpi
>>>>> MpiParams=ports=12000-12999
>>>>> #PluginDir=
>>>>> #PlugStackConfig=
>>>>> #PrivateData=jobs
>>>>> ProctrackType=proctrack/cgroup
>>>>> #Prolog=
>>>>> #PrologSlurmctld=
>>>>> #PropagatePrioProcess=0
>>>>> #PropagateResourceLimits=
>>>>> #PropagateResourceLimitsExcept=
>>>>> ReturnToService=2
>>>>> #SallocDefaultCommand=
>>>>> SlurmctldPidFile=/var/run/slurmctld.pid
>>>>> SlurmctldPort=6817
>>>>> SlurmdPidFile=/var/run/slurmd.pid
>>>>> SlurmdPort=6818
>>>>> SlurmdSpoolDir=/tmp/slurmd
>>>>> SlurmUser=slurm
>>>>> #SlurmdUser=root
>>>>> #SrunEpilog=
>>>>> #SrunProlog=
>>>>> StateSaveLocation=/tmp
>>>>> SwitchType=switch/none
>>>>> #TaskEpilog=
>>>>> TaskPlugin=task/cgroup
>>>>> #TaskPluginParam=
>>>>> #TaskProlog=
>>>>> #TopologyPlugin=topology/tree
>>>>> #TmpFs=/tmp
>>>>> #TrackWCKey=no
>>>>> #TreeWidth=
>>>>> #UnkillableStepProgram=
>>>>> #UsePAM=0
>>>>> #
>>>>> #
>>>>> # TIMERS
>>>>> #BatchStartTimeout=10
>>>>> #CompleteWait=0
>>>>> #EpilogMsgTime=2000
>>>>> #GetEnvTimeout=2
>>>>> #HealthCheckInterval=0
>>>>> #HealthCheckProgram=
>>>>> InactiveLimit=0
>>>>> KillWait=60
>>>>> #MessageTimeout=10
>>>>> #ResvOverRun=0
>>>>> MinJobAge=43200
>>>>> #OverTimeLimit=0
>>>>> SlurmctldTimeout=600
>>>>> SlurmdTimeout=600
>>>>> #UnkillableStepTimeout=60
>>>>> #VSizeFactor=0
>>>>> Waittime=0
>>>>> #
>>>>> #
>>>>> # SCHEDULING
>>>>> DefMemPerCPU=1000
>>>>> FastSchedule=1
>>>>> #MaxMemPerCPU=0
>>>>> #SchedulerRootFilter=1
>>>>> #SchedulerTimeSlice=30
>>>>> SchedulerType=sched/backfill
>>>>> #SchedulerPort=
>>>>> SelectType=select/cons_res
>>>>> SelectTypeParameters=CR_CPU_Memory
>>>>> #
>>>>> #
>>>>> # JOB PRIORITY
>>>>> #PriorityType=priority/basic
>>>>> #PriorityDecayHalfLife=
>>>>> #PriorityCalcPeriod=
>>>>> #PriorityFavorSmall=
>>>>> #PriorityMaxAge=
>>>>> #PriorityUsageResetPeriod=
>>>>> #PriorityWeightAge=
>>>>> #PriorityWeightFairshare=
>>>>> #PriorityWeightJobSize=
>>>>> #PriorityWeightPartition=
>>>>> #PriorityWeightQOS=
>>>>> #
>>>>> #
>>>>> # LOGGING AND ACCOUNTING
>>>>> #AccountingStorageEnforce=0
>>>>> #AccountingStorageHost=
>>>>> AccountingStorageLoc=/var/log/slurm-llnl/AccountingStorage.log
>>>>> #AccountingStoragePass=
>>>>> #AccountingStoragePort=
>>>>> AccountingStorageType=accounting_storage/filetxt
>>>>> #AccountingStorageUser=
>>>>> AccountingStoreJobComment=YES
>>>>> ClusterName=cluster
>>>>> #DebugFlags=
>>>>> #JobCompHost=
>>>>> JobCompLoc=/var/log/slurm-llnl/JobComp.log
>>>>> #JobCompPass=
>>>>> #JobCompPort=
>>>>> JobCompType=jobcomp/filetxt
>>>>> #JobCompUser=
>>>>> JobAcctGatherFrequency=60
>>>>> JobAcctGatherType=jobacct_gather/linux
>>>>> SlurmctldDebug=3
>>>>> #SlurmctldLogFile=
>>>>> SlurmdDebug=3
>>>>> #SlurmdLogFile=
>>>>> #SlurmSchedLogFile=
>>>>> #SlurmSchedLogLevel=
>>>>> #
>>>>> #
>>>>> # POWER SAVE SUPPORT FOR IDLE NODES (optional)
>>>>> #SuspendProgram=
>>>>> #ResumeProgram=
>>>>> #SuspendTimeout=
>>>>> #ResumeTimeout=
>>>>> #ResumeRate=
>>>>> #SuspendExcNodes=
>>>>> #SuspendExcParts=
>>>>> #SuspendRate=
>>>>> #SuspendTime=
>>>>> #
>>>>> #
>>>>> # COMPUTE NODES
>>>>> NodeName=node[01-08] CPUs=16 RealMemory=16000 State=UNKNOWN
>>>>> PartitionName=batch Nodes=node[01-08] Default=YES MaxTime=INFINITE
>>>>> State=UP
>>>>>
>>>>>
>>>>> 2018-01-15 16:43 GMT+01:00 Carlos Fenoy <minibit at gmail.com>:
>>>>>
>>>>>> Are you trying to start the slurmd in the headnode or a compute node?
>>>>>>
>>>>>> Can you provide the slurm.conf file?
>>>>>>
>>>>>> Regards,
>>>>>> Carlos
>>>>>>
>>>>>> On Mon, Jan 15, 2018 at 4:30 PM, Elisabetta Falivene <
>>>>>> e.falivene at ilabroma.com> wrote:
>>>>>>
>>>>>>> slurmd -Dvvv says
>>>>>>>
>>>>>>> slurmd: fatal: Unable to determine this slurmd's NodeName
>>>>>>>
>>>>>>> b
>>>>>>>
>>>>>>> 2018-01-15 15:58 GMT+01:00 Douglas Jacobsen <dmjacobsen at lbl.gov>:
>>>>>>>
>>>>>>>> The fact that sinfo is responding shows that at least slurmctld is
>>>>>>>> running.  Slumd, on the other hand is not.  Please also get output of
>>>>>>>> slurmd log or running "slurmd -Dvvv"
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>> On Jan 15, 2018 06:42, "Elisabetta Falivene" <
>>>>>>>> e.falivene at ilabroma.com> wrote:
>>>>>>>>
>>>>>>>>> > Anyway I suggest to update the operating system to stretch and
>>>>>>>>> fix your
>>>>>>>>> > configuration under a more recent version of slurm.
>>>>>>>>>
>>>>>>>>> I think I'll soon arrive to that :)
>>>>>>>>> b
>>>>>>>>>
>>>>>>>>> 2018-01-15 14:08 GMT+01:00 Gennaro Oliva <oliva.g at na.icar.cnr.it>:
>>>>>>>>>
>>>>>>>>>> Ciao Elisabetta,
>>>>>>>>>>
>>>>>>>>>> On Mon, Jan 15, 2018 at 01:13:27PM +0100, Elisabetta Falivene
>>>>>>>>>> wrote:
>>>>>>>>>> > Error messages are not much helping me in guessing what is
>>>>>>>>>> going on. What
>>>>>>>>>> > should I check to get what is failing?
>>>>>>>>>>
>>>>>>>>>> check slurmctld.log and slurmd.log, you can find them under
>>>>>>>>>> /var/log/slurm-llnl
>>>>>>>>>>
>>>>>>>>>> > *PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST*
>>>>>>>>>> > *batch*       up   infinite      8   unk* node[01-08]*
>>>>>>>>>> >
>>>>>>>>>> >
>>>>>>>>>> > Running
>>>>>>>>>> > *systemctl status slurmctld.service*
>>>>>>>>>> >
>>>>>>>>>> > returns
>>>>>>>>>> >
>>>>>>>>>> > *slurmctld.service - Slurm controller daemon*
>>>>>>>>>> > *   Loaded: loaded (/lib/systemd/system/slurmctld.service;
>>>>>>>>>> enabled)*
>>>>>>>>>> > *   Active: failed (Result: timeout) since Mon 2018-01-15
>>>>>>>>>> 13:03:39 CET; 41s
>>>>>>>>>> > ago*
>>>>>>>>>> > *  Process: 2098 ExecStart=/usr/sbin/slurmctld
>>>>>>>>>> $SLURMCTLD_OPTIONS
>>>>>>>>>> > (code=exited, status=0/SUCCESS)*
>>>>>>>>>> >
>>>>>>>>>> > * slurmctld[2100]: cons_res: select_p_reconfigure*
>>>>>>>>>> > * slurmctld[2100]: cons_res: select_p_node_init*
>>>>>>>>>> > * slurmctld[2100]: cons_res: preparing for 1 partitions*
>>>>>>>>>> > * slurmctld[2100]: Running as primary controller*
>>>>>>>>>> > * slurmctld[2100]:
>>>>>>>>>> > SchedulerParameters=default_queue_depth=100,max_rpc_cnt=0,ma
>>>>>>>>>> x_sched_time=4,partition_job_depth=0*
>>>>>>>>>> > * slurmctld.service start operation timed out. Terminating.*
>>>>>>>>>> > *Terminate signal (SIGINT or SIGTERM) received*
>>>>>>>>>> > * slurmctld[2100]: Saving all slurm state*
>>>>>>>>>> > * Failed to start Slurm controller daemon.*
>>>>>>>>>> > * Unit slurmctld.service entered failed state.*
>>>>>>>>>>
>>>>>>>>>> Do you have a backup controller?
>>>>>>>>>> Check your slurm.conf under:
>>>>>>>>>> /etc/slurm-llnl
>>>>>>>>>>
>>>>>>>>>> Anyway I suggest to update the operating system to stretch and
>>>>>>>>>> fix your
>>>>>>>>>> configuration under a more recent version of slurm.
>>>>>>>>>> Best regards
>>>>>>>>>> --
>>>>>>>>>> Gennaro Oliva
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> --
>>>>>> Carles Fenoy
>>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> --
>>>> Carles Fenoy
>>>>
>>>
>>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20180116/7edcdad1/attachment-0001.html>


More information about the slurm-users mailing list