[slurm-users] Slurm not starting

Carlos Fenoy minibit at gmail.com
Mon Jan 15 10:32:12 MST 2018


It seems like the pidfile in systemd and slurm.conf are different. Check if
they are the same and if not adjust the slurm.conf pid files. That should
prevent systemd from killing slurm.

On Mon, 15 Jan 2018, 18:24 Elisabetta Falivene, <e.falivene at ilabroma.com>
wrote:

> The deeper I go in the problem, the worser it seems... but maybe I'm a
> step closer to the solution.
>
> I discovered that munge was disabled on the nodes (my fault, Gennaro
> pointed out the problem before, but I enabled it back only on the master).
> Btw, it's very strange that the wheezy->jessie upgrade disabled munge on
> all nodes and master...
>
> Unfortunately, re-enabling munge on the nodes, didn't made slurmd start
> again.
>
> Maybe filling this setting could give me some info about the problem?
> *#SlurmdLogFile=*
>
> Thank you very much for your help. Is very precious to me.
> betta
>
> Ps: some test I made ->
>
> Running on the nodes
>
> *slurm -Dvvv*
>
> returns
>
> *slurmd: debug2: hwloc_topology_init*
> *slurmd: debug2: hwloc_topology_load*
> *slurmd: Considering each NUMA node as a socket*
> *slurmd: debug:  CPUs:16 Boards:1 Sockets:4 CoresPerSocket:4
> ThreadsPerCore:1*
> *slurmd: Node configuration differs from hardware: CPUs=16:16(hw)
> Boards=1:1(hw) SocketsPerBoard=16:4(hw) CoresPerSocket=1:4(hw)
> ThreadsPerCore=1:1(hw)*
> *slurmd: topology NONE plugin loaded*
> *slurmd: Gathering cpu frequency information for 16 cpus*
> *slurmd: debug:  Reading cgroup.conf file /etc/slurm-llnl/cgroup.conf*
> *slurmd: debug2: hwloc_topology_init*
> *slurmd: debug2: hwloc_topology_load*
> *slurmd: Considering each NUMA node as a socket*
> *slurmd: debug:  CPUs:16 Boards:1 Sockets:4 CoresPerSocket:4
> ThreadsPerCore:1*
> *slurmd: debug:  Reading cgroup.conf file /etc/slurm-llnl/cgroup.conf*
> *slurmd: debug:  task/cgroup: now constraining jobs allocated cores*
> *slurmd: task/cgroup: loaded*
> *slurmd: auth plugin for Munge (http://code.google.com/p/munge/
> <http://code.google.com/p/munge/>) loaded*
> *slurmd: debug:  spank: opening plugin stack
> /etc/slurm-llnl/plugstack.conf*
> *slurmd: Munge cryptographic signature plugin loaded*
> *slurmd: Warning: Core limit is only 0 KB*
> *slurmd: slurmd version 14.03.9 started*
> *slurmd: Job accounting gather LINUX plugin loaded*
> *slurmd: debug:  job_container none plugin loaded*
> *slurmd: switch NONE plugin loaded*
> *slurmd: slurmd started on Mon, 15 Jan 2018 18:07:17 +0100*
> *slurmd: CPUs=16 Boards=1 Sockets=16 Cores=1 Threads=1 Memory=15999
> TmpDisk=40189 Uptime=1254*
> *slurmd: AcctGatherEnergy NONE plugin loaded*
> *slurmd: AcctGatherProfile NONE plugin loaded*
> *slurmd: AcctGatherInfiniband NONE plugin loaded*
> *slurmd: AcctGatherFilesystem NONE plugin loaded*
> *slurmd: debug2: No acct_gather.conf file
> (/etc/slurm-llnl/acct_gather.conf)*
> *slurmd: debug2: _slurm_connect failed: Connection refused*
> *slurmd: debug2: Error connecting slurm stream socket at 192.168.1.1:6817
> <http://192.168.1.1:6817>: Connection refused*
> *slurmd: debug:  Failed to contact primary controller: Connection refused*
> *slurmd: debug2: _slurm_connect failed: Connection refused*
> *slurmd: debug2: Error connecting slurm stream socket at 192.168.1.1:6817
> <http://192.168.1.1:6817>: Connection refused*
> *slurmd: debug:  Failed to contact primary controller: Connection refused*
> *slurmd: debug2: _slurm_connect failed: Connection refused*
> *slurmd: debug2: Error connecting slurm stream socket at 192.168.1.1:6817
> <http://192.168.1.1:6817>: Connection refused*
> *slurmd: debug:  Failed to contact primary controller: Connection refused*
> *slurmd: debug2: _slurm_connect failed: Connection refused*
> *slurmd: debug2: Error connecting slurm stream socket at 192.168.1.1:6817
> <http://192.168.1.1:6817>: Connection refused*
> *slurmd: debug:  Failed to contact primary controller: Connection refused*
> *slurmd: debug2: _slurm_connect failed: Connection refused*
> *slurmd: debug2: Error connecting slurm stream socket at 192.168.1.1:6817
> <http://192.168.1.1:6817>: Connection refused*
> *slurmd: debug:  Failed to contact primary controller: Connection refused*
> *^Cslurmd: got shutdown request*
> *slurmd: waiting on 1 active threads*
> *slurmd: debug2: _slurm_connect failed: Connection refused*
> *slurmd: debug2: Error connecting slurm stream socket at 192.168.1.1:6817
> <http://192.168.1.1:6817>: Connection refused*
> *slurmd: debug:  Failed to contact primary controller: Connection refused*
> *slurmd: debug2: _slurm_connect failed: Connection refused*
> *slurmd: debug2: Error connecting slurm stream socket at 192.168.1.1:6817
> <http://192.168.1.1:6817>: Connection refused*
> *slurmd: debug:  Failed to contact primary controller: Connection refused*
> *©©slurmd: debug2: _slurm_connect failed: Connection refused*
> *slurmd: debug2: Error connecting slurm stream socket at 192.168.1.1:6817
> <http://192.168.1.1:6817>: Connection refused*
> *slurmd: debug:  Failed to contact primary controller: Connection refused*
> *^C^C^C^Cslurmd: debug2: _slurm_connect failed: Connection refused*
> *slurmd: debug2: Error connecting slurm stream socket at 192.168.1.1:6817
> <http://192.168.1.1:6817>: Connection refused*
> *slurmd: debug:  Failed to contact primary controller: Connection refused*
> *slurmd: debug2: _slurm_connect failed: Connection refused*
> *slurmd: debug2: Error connecting slurm stream socket at 192.168.1.1:6817
> <http://192.168.1.1:6817>: Connection refused*
> *slurmd: debug:  Failed to contact primary controller: Connection refused*
> *slurmd: error: Unable to register: Unable to contact slurm controller
> (connect failure)*
> *slurmd: debug:  Unable to register with slurm controller, retrying*
> *slurmd: all threads complete*
> *slurmd: Consumable Resources (CR) Node Selection plugin shutting down ...*
> *slurmd: Munge cryptographic signature plugin unloaded*
> *slurmd: Slurmd shutdown completing*
>
> which maybe it is not so bad as it seems for it may only point out that
> slurm is not up on the master, isn't?
>
> On the master running
>
> *service slurmctld restart*
>
> returns
>
> *Job for slurmctld.service failed. See 'systemctl status
> slurmctld.service' and 'journalctl -xn' for details.*
>
> and
>
> *service slurmctld status*
>
> *returns*
>
> *slurmctld.service - Slurm controller daemon*
> *   Loaded: loaded (/lib/systemd/system/slurmctld.service; enabled)*
> *   Active: failed (Result: timeout) since Mon 2018-01-15 18:11:20 CET;
> 44s ago*
> *  Process: 2223 ExecStart=/usr/sbin/slurmctld $SLURMCTLD_OPTIONS
> (code=exited, status=0/SUCCESS)*
>
> * slurmctld[2225]: cons_res: select_p_reconfigure*
> * slurmctld[2225]: cons_res: select_p_node_init*
> * slurmctld[2225]: cons_res: preparing for 1 partitions*
> * slurmctld[2225]: Running as primary controller*
> * slurmctld[2225]:
> SchedulerParameters=default_queue_depth=100,max_rpc_cnt=0,max_sched_time=4,partition_job_depth=0*
> * systemd[1]: slurmctld.service start operation timed out. Terminating.*
> * slurmctld[2225]: Terminate signal (SIGINT or SIGTERM) received*
> * slurmctld[2225]: Saving all slurm state*
> * systemd[1]: Failed to start Slurm controller daemon.*
> * systemd[1]: Unit slurmctld.service entered failed state.*
>
> and
> *journalctl -xn*
>
> returns no visible error
>
> *-- Logs begin at Mon 2018-01-15 18:04:38 CET, end at Mon 2018-01-15
> 18:17:33 CET. --*
> *Jan 15 18:16:23 anyone.phys.uniroma1.it <http://anyone.phys.uniroma1.it>
> slurmctld[2286]: Saving all slurm state*
> *Jan 15 18:16:23 anyone.phys.uniroma1.it <http://anyone.phys.uniroma1.it>
> systemd[1]: Failed to start Slurm controller daemon.*
> *-- Subject: Unit slurmctld.service has failed*
> *-- Defined-By: systemd*
> *-- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel
> <http://lists.freedesktop.org/mailman/listinfo/systemd-devel>*
> *-- *
> *-- Unit slurmctld.service has failed.*
> *-- *
> *-- The result is failed.*
> * systemd[1]: Unit slurmctld.service entered failed state.*
> * CRON[2312]: pam_unix(cron:session): session opened for user root by
> (uid=0)*
> *CRON[2313]: (root) CMD (   cd / && run-parts --report /etc/cron.hourly)*
> *CRON[2312]: pam_unix(cron:session): session closed for user root*
> *dhcpd[1538]: DHCPREQUEST for 192.168.1.101 from c8:60:00:32:c6:c4 via
> eth1*
> * dhcpd[1538]: DHCPACK on 192.168.1.101 to c8:60:00:32:c6:c4 via eth1*
> *dhcpd[1538]: DHCPREQUEST for 192.168.1.102 from bc:ae:c5:12:97:75 via
> eth1*
> *dhcpd[1538]: DHCPACK on 192.168.1.102 to bc:ae:c5:12:97:75 via eth1*
>
> 2018-01-15 16:56 GMT+01:00 Carlos Fenoy <minibit at gmail.com>:
>
>> Hi,
>>
>> you can not start the slurmd on the headnode. Try running the same
>> command on the compute nodes and check the output. If there is any issue it
>> should display the reason.
>>
>> Regards,
>> Carlos
>>
>> On Mon, Jan 15, 2018 at 4:50 PM, Elisabetta Falivene <
>> e.falivene at ilabroma.com> wrote:
>>
>>> In the headnode. (I'm also noticing, and seems good to tell, for maybe
>>> the problem is the same, even ldap is not working as expected giving a
>>> message "invalid credential (49)" which is a message given when there are
>>> problem of this type. The update to jessie must have touched something that
>>> is affecting all my software sanity :D )
>>>
>>> Here is the my slurm.conf.
>>>
>>> # slurm.conf file generated by configurator.html.
>>> # Put this file on all nodes of your cluster.
>>> # See the slurm.conf man page for more information.
>>> #
>>> ControlMachine=anyone
>>> ControlAddr=master
>>> #BackupController=
>>> #BackupAddr=
>>> #
>>> AuthType=auth/munge
>>> CacheGroups=0
>>> #CheckpointType=checkpoint/none
>>> CryptoType=crypto/munge
>>> #DisableRootJobs=NO
>>> #EnforcePartLimits=NO
>>> #Epilog=
>>> #EpilogSlurmctld=
>>> #FirstJobId=1
>>> #MaxJobId=999999
>>> #GresTypes=
>>> #GroupUpdateForce=0
>>> #GroupUpdateTime=600
>>> #JobCheckpointDir=/var/slurm/checkpoint
>>> #JobCredentialPrivateKey=
>>> #JobCredentialPublicCertificate=
>>> #JobFileAppend=0
>>> #JobRequeue=1
>>> #JobSubmitPlugins=1
>>> #KillOnBadExit=0
>>> #Licenses=foo*4,bar
>>> #MailProg=/bin/mail
>>> #MaxJobCount=5000
>>> #MaxStepCount=40000
>>> #MaxTasksPerNode=128
>>> MpiDefault=openmpi
>>> MpiParams=ports=12000-12999
>>> #PluginDir=
>>> #PlugStackConfig=
>>> #PrivateData=jobs
>>> ProctrackType=proctrack/cgroup
>>> #Prolog=
>>> #PrologSlurmctld=
>>> #PropagatePrioProcess=0
>>> #PropagateResourceLimits=
>>> #PropagateResourceLimitsExcept=
>>> ReturnToService=2
>>> #SallocDefaultCommand=
>>> SlurmctldPidFile=/var/run/slurmctld.pid
>>> SlurmctldPort=6817
>>> SlurmdPidFile=/var/run/slurmd.pid
>>> SlurmdPort=6818
>>> SlurmdSpoolDir=/tmp/slurmd
>>> SlurmUser=slurm
>>> #SlurmdUser=root
>>> #SrunEpilog=
>>> #SrunProlog=
>>> StateSaveLocation=/tmp
>>> SwitchType=switch/none
>>> #TaskEpilog=
>>> TaskPlugin=task/cgroup
>>> #TaskPluginParam=
>>> #TaskProlog=
>>> #TopologyPlugin=topology/tree
>>> #TmpFs=/tmp
>>> #TrackWCKey=no
>>> #TreeWidth=
>>> #UnkillableStepProgram=
>>> #UsePAM=0
>>> #
>>> #
>>> # TIMERS
>>> #BatchStartTimeout=10
>>> #CompleteWait=0
>>> #EpilogMsgTime=2000
>>> #GetEnvTimeout=2
>>> #HealthCheckInterval=0
>>> #HealthCheckProgram=
>>> InactiveLimit=0
>>> KillWait=60
>>> #MessageTimeout=10
>>> #ResvOverRun=0
>>> MinJobAge=43200
>>> #OverTimeLimit=0
>>> SlurmctldTimeout=600
>>> SlurmdTimeout=600
>>> #UnkillableStepTimeout=60
>>> #VSizeFactor=0
>>> Waittime=0
>>> #
>>> #
>>> # SCHEDULING
>>> DefMemPerCPU=1000
>>> FastSchedule=1
>>> #MaxMemPerCPU=0
>>> #SchedulerRootFilter=1
>>> #SchedulerTimeSlice=30
>>> SchedulerType=sched/backfill
>>> #SchedulerPort=
>>> SelectType=select/cons_res
>>> SelectTypeParameters=CR_CPU_Memory
>>> #
>>> #
>>> # JOB PRIORITY
>>> #PriorityType=priority/basic
>>> #PriorityDecayHalfLife=
>>> #PriorityCalcPeriod=
>>> #PriorityFavorSmall=
>>> #PriorityMaxAge=
>>> #PriorityUsageResetPeriod=
>>> #PriorityWeightAge=
>>> #PriorityWeightFairshare=
>>> #PriorityWeightJobSize=
>>> #PriorityWeightPartition=
>>> #PriorityWeightQOS=
>>> #
>>> #
>>> # LOGGING AND ACCOUNTING
>>> #AccountingStorageEnforce=0
>>> #AccountingStorageHost=
>>> AccountingStorageLoc=/var/log/slurm-llnl/AccountingStorage.log
>>> #AccountingStoragePass=
>>> #AccountingStoragePort=
>>> AccountingStorageType=accounting_storage/filetxt
>>> #AccountingStorageUser=
>>> AccountingStoreJobComment=YES
>>> ClusterName=cluster
>>> #DebugFlags=
>>> #JobCompHost=
>>> JobCompLoc=/var/log/slurm-llnl/JobComp.log
>>> #JobCompPass=
>>> #JobCompPort=
>>> JobCompType=jobcomp/filetxt
>>> #JobCompUser=
>>> JobAcctGatherFrequency=60
>>> JobAcctGatherType=jobacct_gather/linux
>>> SlurmctldDebug=3
>>> #SlurmctldLogFile=
>>> SlurmdDebug=3
>>> #SlurmdLogFile=
>>> #SlurmSchedLogFile=
>>> #SlurmSchedLogLevel=
>>> #
>>> #
>>> # POWER SAVE SUPPORT FOR IDLE NODES (optional)
>>> #SuspendProgram=
>>> #ResumeProgram=
>>> #SuspendTimeout=
>>> #ResumeTimeout=
>>> #ResumeRate=
>>> #SuspendExcNodes=
>>> #SuspendExcParts=
>>> #SuspendRate=
>>> #SuspendTime=
>>> #
>>> #
>>> # COMPUTE NODES
>>> NodeName=node[01-08] CPUs=16 RealMemory=16000 State=UNKNOWN
>>> PartitionName=batch Nodes=node[01-08] Default=YES MaxTime=INFINITE
>>> State=UP
>>>
>>>
>>> 2018-01-15 16:43 GMT+01:00 Carlos Fenoy <minibit at gmail.com>:
>>>
>>>> Are you trying to start the slurmd in the headnode or a compute node?
>>>>
>>>> Can you provide the slurm.conf file?
>>>>
>>>> Regards,
>>>> Carlos
>>>>
>>>> On Mon, Jan 15, 2018 at 4:30 PM, Elisabetta Falivene <
>>>> e.falivene at ilabroma.com> wrote:
>>>>
>>>>> slurmd -Dvvv says
>>>>>
>>>>> slurmd: fatal: Unable to determine this slurmd's NodeName
>>>>>
>>>>> b
>>>>>
>>>>> 2018-01-15 15:58 GMT+01:00 Douglas Jacobsen <dmjacobsen at lbl.gov>:
>>>>>
>>>>>> The fact that sinfo is responding shows that at least slurmctld is
>>>>>> running.  Slumd, on the other hand is not.  Please also get output of
>>>>>> slurmd log or running "slurmd -Dvvv"
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>>
>>>>>> On Jan 15, 2018 06:42, "Elisabetta Falivene" <e.falivene at ilabroma.com>
>>>>>> wrote:
>>>>>>
>>>>>>> > Anyway I suggest to update the operating system to stretch and fix
>>>>>>> your
>>>>>>> > configuration under a more recent version of slurm.
>>>>>>>
>>>>>>> I think I'll soon arrive to that :)
>>>>>>> b
>>>>>>>
>>>>>>> 2018-01-15 14:08 GMT+01:00 Gennaro Oliva <oliva.g at na.icar.cnr.it>:
>>>>>>>
>>>>>>>> Ciao Elisabetta,
>>>>>>>>
>>>>>>>> On Mon, Jan 15, 2018 at 01:13:27PM +0100, Elisabetta Falivene wrote:
>>>>>>>> > Error messages are not much helping me in guessing what is going
>>>>>>>> on. What
>>>>>>>> > should I check to get what is failing?
>>>>>>>>
>>>>>>>> check slurmctld.log and slurmd.log, you can find them under
>>>>>>>> /var/log/slurm-llnl
>>>>>>>>
>>>>>>>> > *PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST*
>>>>>>>> > *batch*       up   infinite      8   unk* node[01-08]*
>>>>>>>> >
>>>>>>>> >
>>>>>>>> > Running
>>>>>>>> > *systemctl status slurmctld.service*
>>>>>>>> >
>>>>>>>> > returns
>>>>>>>> >
>>>>>>>> > *slurmctld.service - Slurm controller daemon*
>>>>>>>> > *   Loaded: loaded (/lib/systemd/system/slurmctld.service;
>>>>>>>> enabled)*
>>>>>>>> > *   Active: failed (Result: timeout) since Mon 2018-01-15
>>>>>>>> 13:03:39 CET; 41s
>>>>>>>> > ago*
>>>>>>>> > *  Process: 2098 ExecStart=/usr/sbin/slurmctld $SLURMCTLD_OPTIONS
>>>>>>>> > (code=exited, status=0/SUCCESS)*
>>>>>>>> >
>>>>>>>> > * slurmctld[2100]: cons_res: select_p_reconfigure*
>>>>>>>> > * slurmctld[2100]: cons_res: select_p_node_init*
>>>>>>>> > * slurmctld[2100]: cons_res: preparing for 1 partitions*
>>>>>>>> > * slurmctld[2100]: Running as primary controller*
>>>>>>>> > * slurmctld[2100]:
>>>>>>>> >
>>>>>>>> SchedulerParameters=default_queue_depth=100,max_rpc_cnt=0,max_sched_time=4,partition_job_depth=0*
>>>>>>>> > * slurmctld.service start operation timed out. Terminating.*
>>>>>>>> > *Terminate signal (SIGINT or SIGTERM) received*
>>>>>>>> > * slurmctld[2100]: Saving all slurm state*
>>>>>>>> > * Failed to start Slurm controller daemon.*
>>>>>>>> > * Unit slurmctld.service entered failed state.*
>>>>>>>>
>>>>>>>> Do you have a backup controller?
>>>>>>>> Check your slurm.conf under:
>>>>>>>> /etc/slurm-llnl
>>>>>>>>
>>>>>>>> Anyway I suggest to update the operating system to stretch and fix
>>>>>>>> your
>>>>>>>> configuration under a more recent version of slurm.
>>>>>>>> Best regards
>>>>>>>> --
>>>>>>>> Gennaro Oliva
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> --
>>>> Carles Fenoy
>>>>
>>>
>>>
>>
>>
>> --
>> --
>> Carles Fenoy
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20180115/dc496089/attachment-0001.html>


More information about the slurm-users mailing list