[slurm-users] Slurm not starting
Elisabetta Falivene
e.falivene at ilabroma.com
Mon Jan 15 10:22:00 MST 2018
The deeper I go in the problem, the worser it seems... but maybe I'm a step
closer to the solution.
I discovered that munge was disabled on the nodes (my fault, Gennaro
pointed out the problem before, but I enabled it back only on the master).
Btw, it's very strange that the wheezy->jessie upgrade disabled munge on
all nodes and master...
Unfortunately, re-enabling munge on the nodes, didn't made slurmd start
again.
Maybe filling this setting could give me some info about the problem?
*#SlurmdLogFile=*
Thank you very much for your help. Is very precious to me.
betta
Ps: some test I made ->
Running on the nodes
*slurm -Dvvv*
returns
*slurmd: debug2: hwloc_topology_init*
*slurmd: debug2: hwloc_topology_load*
*slurmd: Considering each NUMA node as a socket*
*slurmd: debug: CPUs:16 Boards:1 Sockets:4 CoresPerSocket:4
ThreadsPerCore:1*
*slurmd: Node configuration differs from hardware: CPUs=16:16(hw)
Boards=1:1(hw) SocketsPerBoard=16:4(hw) CoresPerSocket=1:4(hw)
ThreadsPerCore=1:1(hw)*
*slurmd: topology NONE plugin loaded*
*slurmd: Gathering cpu frequency information for 16 cpus*
*slurmd: debug: Reading cgroup.conf file /etc/slurm-llnl/cgroup.conf*
*slurmd: debug2: hwloc_topology_init*
*slurmd: debug2: hwloc_topology_load*
*slurmd: Considering each NUMA node as a socket*
*slurmd: debug: CPUs:16 Boards:1 Sockets:4 CoresPerSocket:4
ThreadsPerCore:1*
*slurmd: debug: Reading cgroup.conf file /etc/slurm-llnl/cgroup.conf*
*slurmd: debug: task/cgroup: now constraining jobs allocated cores*
*slurmd: task/cgroup: loaded*
*slurmd: auth plugin for Munge (http://code.google.com/p/munge/
<http://code.google.com/p/munge/>) loaded*
*slurmd: debug: spank: opening plugin stack /etc/slurm-llnl/plugstack.conf*
*slurmd: Munge cryptographic signature plugin loaded*
*slurmd: Warning: Core limit is only 0 KB*
*slurmd: slurmd version 14.03.9 started*
*slurmd: Job accounting gather LINUX plugin loaded*
*slurmd: debug: job_container none plugin loaded*
*slurmd: switch NONE plugin loaded*
*slurmd: slurmd started on Mon, 15 Jan 2018 18:07:17 +0100*
*slurmd: CPUs=16 Boards=1 Sockets=16 Cores=1 Threads=1 Memory=15999
TmpDisk=40189 Uptime=1254*
*slurmd: AcctGatherEnergy NONE plugin loaded*
*slurmd: AcctGatherProfile NONE plugin loaded*
*slurmd: AcctGatherInfiniband NONE plugin loaded*
*slurmd: AcctGatherFilesystem NONE plugin loaded*
*slurmd: debug2: No acct_gather.conf file
(/etc/slurm-llnl/acct_gather.conf)*
*slurmd: debug2: _slurm_connect failed: Connection refused*
*slurmd: debug2: Error connecting slurm stream socket at 192.168.1.1:6817
<http://192.168.1.1:6817>: Connection refused*
*slurmd: debug: Failed to contact primary controller: Connection refused*
*slurmd: debug2: _slurm_connect failed: Connection refused*
*slurmd: debug2: Error connecting slurm stream socket at 192.168.1.1:6817
<http://192.168.1.1:6817>: Connection refused*
*slurmd: debug: Failed to contact primary controller: Connection refused*
*slurmd: debug2: _slurm_connect failed: Connection refused*
*slurmd: debug2: Error connecting slurm stream socket at 192.168.1.1:6817
<http://192.168.1.1:6817>: Connection refused*
*slurmd: debug: Failed to contact primary controller: Connection refused*
*slurmd: debug2: _slurm_connect failed: Connection refused*
*slurmd: debug2: Error connecting slurm stream socket at 192.168.1.1:6817
<http://192.168.1.1:6817>: Connection refused*
*slurmd: debug: Failed to contact primary controller: Connection refused*
*slurmd: debug2: _slurm_connect failed: Connection refused*
*slurmd: debug2: Error connecting slurm stream socket at 192.168.1.1:6817
<http://192.168.1.1:6817>: Connection refused*
*slurmd: debug: Failed to contact primary controller: Connection refused*
*^Cslurmd: got shutdown request*
*slurmd: waiting on 1 active threads*
*slurmd: debug2: _slurm_connect failed: Connection refused*
*slurmd: debug2: Error connecting slurm stream socket at 192.168.1.1:6817
<http://192.168.1.1:6817>: Connection refused*
*slurmd: debug: Failed to contact primary controller: Connection refused*
*slurmd: debug2: _slurm_connect failed: Connection refused*
*slurmd: debug2: Error connecting slurm stream socket at 192.168.1.1:6817
<http://192.168.1.1:6817>: Connection refused*
*slurmd: debug: Failed to contact primary controller: Connection refused*
*©©slurmd: debug2: _slurm_connect failed: Connection refused*
*slurmd: debug2: Error connecting slurm stream socket at 192.168.1.1:6817
<http://192.168.1.1:6817>: Connection refused*
*slurmd: debug: Failed to contact primary controller: Connection refused*
*^C^C^C^Cslurmd: debug2: _slurm_connect failed: Connection refused*
*slurmd: debug2: Error connecting slurm stream socket at 192.168.1.1:6817
<http://192.168.1.1:6817>: Connection refused*
*slurmd: debug: Failed to contact primary controller: Connection refused*
*slurmd: debug2: _slurm_connect failed: Connection refused*
*slurmd: debug2: Error connecting slurm stream socket at 192.168.1.1:6817
<http://192.168.1.1:6817>: Connection refused*
*slurmd: debug: Failed to contact primary controller: Connection refused*
*slurmd: error: Unable to register: Unable to contact slurm controller
(connect failure)*
*slurmd: debug: Unable to register with slurm controller, retrying*
*slurmd: all threads complete*
*slurmd: Consumable Resources (CR) Node Selection plugin shutting down ...*
*slurmd: Munge cryptographic signature plugin unloaded*
*slurmd: Slurmd shutdown completing*
which maybe it is not so bad as it seems for it may only point out that
slurm is not up on the master, isn't?
On the master running
*service slurmctld restart*
returns
*Job for slurmctld.service failed. See 'systemctl status slurmctld.service'
and 'journalctl -xn' for details.*
and
*service slurmctld status*
*returns*
*slurmctld.service - Slurm controller daemon*
* Loaded: loaded (/lib/systemd/system/slurmctld.service; enabled)*
* Active: failed (Result: timeout) since Mon 2018-01-15 18:11:20 CET; 44s
ago*
* Process: 2223 ExecStart=/usr/sbin/slurmctld $SLURMCTLD_OPTIONS
(code=exited, status=0/SUCCESS)*
* slurmctld[2225]: cons_res: select_p_reconfigure*
* slurmctld[2225]: cons_res: select_p_node_init*
* slurmctld[2225]: cons_res: preparing for 1 partitions*
* slurmctld[2225]: Running as primary controller*
* slurmctld[2225]:
SchedulerParameters=default_queue_depth=100,max_rpc_cnt=0,max_sched_time=4,partition_job_depth=0*
* systemd[1]: slurmctld.service start operation timed out. Terminating.*
* slurmctld[2225]: Terminate signal (SIGINT or SIGTERM) received*
* slurmctld[2225]: Saving all slurm state*
* systemd[1]: Failed to start Slurm controller daemon.*
* systemd[1]: Unit slurmctld.service entered failed state.*
and
*journalctl -xn*
returns no visible error
*-- Logs begin at Mon 2018-01-15 18:04:38 CET, end at Mon 2018-01-15
18:17:33 CET. --*
*Jan 15 18:16:23 anyone.phys.uniroma1.it <http://anyone.phys.uniroma1.it>
slurmctld[2286]: Saving all slurm state*
*Jan 15 18:16:23 anyone.phys.uniroma1.it <http://anyone.phys.uniroma1.it>
systemd[1]: Failed to start Slurm controller daemon.*
*-- Subject: Unit slurmctld.service has failed*
*-- Defined-By: systemd*
*-- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel
<http://lists.freedesktop.org/mailman/listinfo/systemd-devel>*
*-- *
*-- Unit slurmctld.service has failed.*
*-- *
*-- The result is failed.*
* systemd[1]: Unit slurmctld.service entered failed state.*
* CRON[2312]: pam_unix(cron:session): session opened for user root by
(uid=0)*
*CRON[2313]: (root) CMD ( cd / && run-parts --report /etc/cron.hourly)*
*CRON[2312]: pam_unix(cron:session): session closed for user root*
*dhcpd[1538]: DHCPREQUEST for 192.168.1.101 from c8:60:00:32:c6:c4 via eth1*
* dhcpd[1538]: DHCPACK on 192.168.1.101 to c8:60:00:32:c6:c4 via eth1*
*dhcpd[1538]: DHCPREQUEST for 192.168.1.102 from bc:ae:c5:12:97:75 via eth1*
*dhcpd[1538]: DHCPACK on 192.168.1.102 to bc:ae:c5:12:97:75 via eth1*
2018-01-15 16:56 GMT+01:00 Carlos Fenoy <minibit at gmail.com>:
> Hi,
>
> you can not start the slurmd on the headnode. Try running the same command
> on the compute nodes and check the output. If there is any issue it should
> display the reason.
>
> Regards,
> Carlos
>
> On Mon, Jan 15, 2018 at 4:50 PM, Elisabetta Falivene <
> e.falivene at ilabroma.com> wrote:
>
>> In the headnode. (I'm also noticing, and seems good to tell, for maybe
>> the problem is the same, even ldap is not working as expected giving a
>> message "invalid credential (49)" which is a message given when there are
>> problem of this type. The update to jessie must have touched something that
>> is affecting all my software sanity :D )
>>
>> Here is the my slurm.conf.
>>
>> # slurm.conf file generated by configurator.html.
>> # Put this file on all nodes of your cluster.
>> # See the slurm.conf man page for more information.
>> #
>> ControlMachine=anyone
>> ControlAddr=master
>> #BackupController=
>> #BackupAddr=
>> #
>> AuthType=auth/munge
>> CacheGroups=0
>> #CheckpointType=checkpoint/none
>> CryptoType=crypto/munge
>> #DisableRootJobs=NO
>> #EnforcePartLimits=NO
>> #Epilog=
>> #EpilogSlurmctld=
>> #FirstJobId=1
>> #MaxJobId=999999
>> #GresTypes=
>> #GroupUpdateForce=0
>> #GroupUpdateTime=600
>> #JobCheckpointDir=/var/slurm/checkpoint
>> #JobCredentialPrivateKey=
>> #JobCredentialPublicCertificate=
>> #JobFileAppend=0
>> #JobRequeue=1
>> #JobSubmitPlugins=1
>> #KillOnBadExit=0
>> #Licenses=foo*4,bar
>> #MailProg=/bin/mail
>> #MaxJobCount=5000
>> #MaxStepCount=40000
>> #MaxTasksPerNode=128
>> MpiDefault=openmpi
>> MpiParams=ports=12000-12999
>> #PluginDir=
>> #PlugStackConfig=
>> #PrivateData=jobs
>> ProctrackType=proctrack/cgroup
>> #Prolog=
>> #PrologSlurmctld=
>> #PropagatePrioProcess=0
>> #PropagateResourceLimits=
>> #PropagateResourceLimitsExcept=
>> ReturnToService=2
>> #SallocDefaultCommand=
>> SlurmctldPidFile=/var/run/slurmctld.pid
>> SlurmctldPort=6817
>> SlurmdPidFile=/var/run/slurmd.pid
>> SlurmdPort=6818
>> SlurmdSpoolDir=/tmp/slurmd
>> SlurmUser=slurm
>> #SlurmdUser=root
>> #SrunEpilog=
>> #SrunProlog=
>> StateSaveLocation=/tmp
>> SwitchType=switch/none
>> #TaskEpilog=
>> TaskPlugin=task/cgroup
>> #TaskPluginParam=
>> #TaskProlog=
>> #TopologyPlugin=topology/tree
>> #TmpFs=/tmp
>> #TrackWCKey=no
>> #TreeWidth=
>> #UnkillableStepProgram=
>> #UsePAM=0
>> #
>> #
>> # TIMERS
>> #BatchStartTimeout=10
>> #CompleteWait=0
>> #EpilogMsgTime=2000
>> #GetEnvTimeout=2
>> #HealthCheckInterval=0
>> #HealthCheckProgram=
>> InactiveLimit=0
>> KillWait=60
>> #MessageTimeout=10
>> #ResvOverRun=0
>> MinJobAge=43200
>> #OverTimeLimit=0
>> SlurmctldTimeout=600
>> SlurmdTimeout=600
>> #UnkillableStepTimeout=60
>> #VSizeFactor=0
>> Waittime=0
>> #
>> #
>> # SCHEDULING
>> DefMemPerCPU=1000
>> FastSchedule=1
>> #MaxMemPerCPU=0
>> #SchedulerRootFilter=1
>> #SchedulerTimeSlice=30
>> SchedulerType=sched/backfill
>> #SchedulerPort=
>> SelectType=select/cons_res
>> SelectTypeParameters=CR_CPU_Memory
>> #
>> #
>> # JOB PRIORITY
>> #PriorityType=priority/basic
>> #PriorityDecayHalfLife=
>> #PriorityCalcPeriod=
>> #PriorityFavorSmall=
>> #PriorityMaxAge=
>> #PriorityUsageResetPeriod=
>> #PriorityWeightAge=
>> #PriorityWeightFairshare=
>> #PriorityWeightJobSize=
>> #PriorityWeightPartition=
>> #PriorityWeightQOS=
>> #
>> #
>> # LOGGING AND ACCOUNTING
>> #AccountingStorageEnforce=0
>> #AccountingStorageHost=
>> AccountingStorageLoc=/var/log/slurm-llnl/AccountingStorage.log
>> #AccountingStoragePass=
>> #AccountingStoragePort=
>> AccountingStorageType=accounting_storage/filetxt
>> #AccountingStorageUser=
>> AccountingStoreJobComment=YES
>> ClusterName=cluster
>> #DebugFlags=
>> #JobCompHost=
>> JobCompLoc=/var/log/slurm-llnl/JobComp.log
>> #JobCompPass=
>> #JobCompPort=
>> JobCompType=jobcomp/filetxt
>> #JobCompUser=
>> JobAcctGatherFrequency=60
>> JobAcctGatherType=jobacct_gather/linux
>> SlurmctldDebug=3
>> #SlurmctldLogFile=
>> SlurmdDebug=3
>> #SlurmdLogFile=
>> #SlurmSchedLogFile=
>> #SlurmSchedLogLevel=
>> #
>> #
>> # POWER SAVE SUPPORT FOR IDLE NODES (optional)
>> #SuspendProgram=
>> #ResumeProgram=
>> #SuspendTimeout=
>> #ResumeTimeout=
>> #ResumeRate=
>> #SuspendExcNodes=
>> #SuspendExcParts=
>> #SuspendRate=
>> #SuspendTime=
>> #
>> #
>> # COMPUTE NODES
>> NodeName=node[01-08] CPUs=16 RealMemory=16000 State=UNKNOWN
>> PartitionName=batch Nodes=node[01-08] Default=YES MaxTime=INFINITE
>> State=UP
>>
>>
>> 2018-01-15 16:43 GMT+01:00 Carlos Fenoy <minibit at gmail.com>:
>>
>>> Are you trying to start the slurmd in the headnode or a compute node?
>>>
>>> Can you provide the slurm.conf file?
>>>
>>> Regards,
>>> Carlos
>>>
>>> On Mon, Jan 15, 2018 at 4:30 PM, Elisabetta Falivene <
>>> e.falivene at ilabroma.com> wrote:
>>>
>>>> slurmd -Dvvv says
>>>>
>>>> slurmd: fatal: Unable to determine this slurmd's NodeName
>>>>
>>>> b
>>>>
>>>> 2018-01-15 15:58 GMT+01:00 Douglas Jacobsen <dmjacobsen at lbl.gov>:
>>>>
>>>>> The fact that sinfo is responding shows that at least slurmctld is
>>>>> running. Slumd, on the other hand is not. Please also get output of
>>>>> slurmd log or running "slurmd -Dvvv"
>>>>>
>>>>
>>>>
>>>>
>>>>
>>>>>
>>>>> On Jan 15, 2018 06:42, "Elisabetta Falivene" <e.falivene at ilabroma.com>
>>>>> wrote:
>>>>>
>>>>>> > Anyway I suggest to update the operating system to stretch and fix
>>>>>> your
>>>>>> > configuration under a more recent version of slurm.
>>>>>>
>>>>>> I think I'll soon arrive to that :)
>>>>>> b
>>>>>>
>>>>>> 2018-01-15 14:08 GMT+01:00 Gennaro Oliva <oliva.g at na.icar.cnr.it>:
>>>>>>
>>>>>>> Ciao Elisabetta,
>>>>>>>
>>>>>>> On Mon, Jan 15, 2018 at 01:13:27PM +0100, Elisabetta Falivene wrote:
>>>>>>> > Error messages are not much helping me in guessing what is going
>>>>>>> on. What
>>>>>>> > should I check to get what is failing?
>>>>>>>
>>>>>>> check slurmctld.log and slurmd.log, you can find them under
>>>>>>> /var/log/slurm-llnl
>>>>>>>
>>>>>>> > *PARTITION AVAIL TIMELIMIT NODES STATE NODELIST*
>>>>>>> > *batch* up infinite 8 unk* node[01-08]*
>>>>>>> >
>>>>>>> >
>>>>>>> > Running
>>>>>>> > *systemctl status slurmctld.service*
>>>>>>> >
>>>>>>> > returns
>>>>>>> >
>>>>>>> > *slurmctld.service - Slurm controller daemon*
>>>>>>> > * Loaded: loaded (/lib/systemd/system/slurmctld.service;
>>>>>>> enabled)*
>>>>>>> > * Active: failed (Result: timeout) since Mon 2018-01-15 13:03:39
>>>>>>> CET; 41s
>>>>>>> > ago*
>>>>>>> > * Process: 2098 ExecStart=/usr/sbin/slurmctld $SLURMCTLD_OPTIONS
>>>>>>> > (code=exited, status=0/SUCCESS)*
>>>>>>> >
>>>>>>> > * slurmctld[2100]: cons_res: select_p_reconfigure*
>>>>>>> > * slurmctld[2100]: cons_res: select_p_node_init*
>>>>>>> > * slurmctld[2100]: cons_res: preparing for 1 partitions*
>>>>>>> > * slurmctld[2100]: Running as primary controller*
>>>>>>> > * slurmctld[2100]:
>>>>>>> > SchedulerParameters=default_queue_depth=100,max_rpc_cnt=0,ma
>>>>>>> x_sched_time=4,partition_job_depth=0*
>>>>>>> > * slurmctld.service start operation timed out. Terminating.*
>>>>>>> > *Terminate signal (SIGINT or SIGTERM) received*
>>>>>>> > * slurmctld[2100]: Saving all slurm state*
>>>>>>> > * Failed to start Slurm controller daemon.*
>>>>>>> > * Unit slurmctld.service entered failed state.*
>>>>>>>
>>>>>>> Do you have a backup controller?
>>>>>>> Check your slurm.conf under:
>>>>>>> /etc/slurm-llnl
>>>>>>>
>>>>>>> Anyway I suggest to update the operating system to stretch and fix
>>>>>>> your
>>>>>>> configuration under a more recent version of slurm.
>>>>>>> Best regards
>>>>>>> --
>>>>>>> Gennaro Oliva
>>>>>>>
>>>>>>>
>>>>>>
>>>>
>>>
>>>
>>> --
>>> --
>>> Carles Fenoy
>>>
>>
>>
>
>
> --
> --
> Carles Fenoy
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20180115/36a02883/attachment-0001.html>
More information about the slurm-users
mailing list