<div dir="ltr"><div>Here is the solution and another (minor) problem!</div><div><br></div><div>Investigating in the direction of the pid problem I found that in the setting there was a <br></div><div><div><b>SlurmctldPidFile=/var/run/slurmctld.pid</b></div><div><b>SlurmdPidFile=/var/run/slurmd.pid</b></div></div><div><br></div><div>but the pid was searched in /var/run/slurm-llnl so I changed in the slurm.conf of the master AND the nodes</div><div><div><b>SlurmctldPidFile=/var/run/slurm-llnl/slurmctld.pid</b></div><div><b>SlurmdPidFile=/var/run/slurm-llnl/slurmd.pid</b></div></div><div><b><br></b></div><div>being again able to launch slurmctld on the master and slurmd on the nodes.</div><div><br></div><div>At this point the nodes were all set to drain automatically giving an error like</div><div><br></div><div><b>error: Node node01 has low real_memory size (15999 < 16000)</b><br></div><div><b><br></b></div><div>so it was necessary to change in the slurm.conf (master and nodes) </div><div><b>NodeName=node[01-08] CPUs=16 RealMemory=16000 State=UNKNOWN</b><br></div><div>to</div><div><b>NodeName=node[01-08] CPUs=16 RealMemory=15999 State=UNKNOWN</b><br></div><div><b><br></b></div><div>Now, slurm works and the nodes are running. There is only one minor problem</div><div><br></div><div><div><b>error: Node node04 has low real_memory size (7984 < 15999)</b></div><div><b>error: Node node02 has low real_memory size (3944 < 15999)</b><br></div></div><div><b><br></b></div><div>Two nodes are still put to drain state. The nodes suffered a physical damage to some rams and I had to physically remove them, so slurm think it is not a good idea to use them. </div><div>It is possibile to make slurm use the node anyway? I know I can scontrol update NodeName=node04 State=RESUME and put back the node to idle state, but as the machine is rebooted or the service restarted it would be set to drain again.</div><div><br></div><div>Thank you for your help!</div><div>b</div></div><div class="gmail_extra"><br><div class="gmail_quote">2018-01-16 13:25 GMT+01:00 Elisabetta Falivene <span dir="ltr"><<a href="mailto:e.falivene@ilabroma.com" target="_blank">e.falivene@ilabroma.com</a>></span>:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr"><div class="gmail_extra"><div class="gmail_quote"><span class=""><br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><p dir="ltr">It seems like the pidfile in systemd and slurm.conf are different. Check if they are the same and if not adjust the slurm.conf pid files. That should prevent systemd from killing slurm.</p></blockquote></span><div>Emh, sorry, how I can do this? </div><div><div class="h5"><div><br></div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div class="m_6556872972326110705HOEnZb"><div class="m_6556872972326110705h5"><div class="gmail_quote"><div dir="ltr">On Mon, 15 Jan 2018, 18:24 Elisabetta Falivene, <<a href="mailto:e.falivene@ilabroma.com" target="_blank">e.falivene@ilabroma.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr">The deeper I go in the problem, the worser it seems... but maybe I'm a step closer to the solution.<div><br><div>I discovered that munge was disabled on the nodes (my fault, Gennaro pointed out the problem before, but I enabled it back only on the master). Btw, it's very strange that the wheezy->jessie upgrade disabled munge on all nodes and master...</div><div><br></div><div>Unfortunately, re-enabling munge on the nodes, didn't made slurmd start again.</div><div><br></div><div>Maybe filling this setting could give me some info about the problem? </div><div><span style="font-size:12.8px"><b>#SlurmdLogFile=</b></span><br></div><div><span style="font-size:12.8px"><br></span></div><div><span style="font-size:12.8px">Thank you very much for your help. Is very precious to me.</span></div><div><span style="font-size:12.8px">betta</span></div><div><span style="font-size:12.8px"><br></span></div><div><span style="font-size:12.8px">Ps: some test I made -></span></div><div><br></div><div>Running on the nodes</div><div><b><br></b></div><div><b>slurm -Dvvv</b></div><div><b><br></b></div><div>returns</div><div><br></div><div><div><b>slurmd: debug2: hwloc_topology_init</b></div><div><b>slurmd: debug2: hwloc_topology_load</b></div><div><b>slurmd: Considering each NUMA node as a socket</b></div><div><b>slurmd: debug:  CPUs:16 Boards:1 Sockets:4 CoresPerSocket:4 ThreadsPerCore:1</b></div><div><b>slurmd: Node configuration differs from hardware: CPUs=16:16(hw) Boards=1:1(hw) SocketsPerBoard=16:4(hw) CoresPerSocket=1:4(hw) ThreadsPerCore=1:1(hw)</b></div><div><b>slurmd: topology NONE plugin loaded</b></div><div><b>slurmd: Gathering cpu frequency information for 16 cpus</b></div><div><b>slurmd: debug:  Reading cgroup.conf file /etc/slurm-llnl/cgroup.conf</b></div><div><b>slurmd: debug2: hwloc_topology_init</b></div><div><b>slurmd: debug2: hwloc_topology_load</b></div><div><b>slurmd: Considering each NUMA node as a socket</b></div><div><b>slurmd: debug:  CPUs:16 Boards:1 Sockets:4 CoresPerSocket:4 ThreadsPerCore:1</b></div><div><b>slurmd: debug:  Reading cgroup.conf file /etc/slurm-llnl/cgroup.conf</b></div><div><b>slurmd: debug:  task/cgroup: now constraining jobs allocated cores</b></div><div><b>slurmd: task/cgroup: loaded</b></div><div><b>slurmd: auth plugin for Munge (<a href="http://code.google.com/p/munge/" target="_blank">http://code.google.com/p/mung<wbr>e/</a>) loaded</b></div><div><b>slurmd: debug:  spank: opening plugin stack /etc/slurm-llnl/plugstack.conf</b></div><div><b>slurmd: Munge cryptographic signature plugin loaded</b></div><div><b>slurmd: Warning: Core limit is only 0 KB</b></div><div><b>slurmd: slurmd version 14.03.9 started</b></div><div><b>slurmd: Job accounting gather LINUX plugin loaded</b></div><div><b>slurmd: debug:  job_container none plugin loaded</b></div><div><b>slurmd: switch NONE plugin loaded</b></div><div><b>slurmd: slurmd started on Mon, 15 Jan 2018 18:07:17 +0100</b></div><div><b>slurmd: CPUs=16 Boards=1 Sockets=16 Cores=1 Threads=1 Memory=15999 TmpDisk=40189 Uptime=1254</b></div><div><b>slurmd: AcctGatherEnergy NONE plugin loaded</b></div><div><b>slurmd: AcctGatherProfile NONE plugin loaded</b></div><div><b>slurmd: AcctGatherInfiniband NONE plugin loaded</b></div><div><b>slurmd: AcctGatherFilesystem NONE plugin loaded</b></div><div><b>slurmd: debug2: No acct_gather.conf file (/etc/slurm-llnl/acct_gather.c<wbr>onf)</b></div><div><b>slurmd: debug2: _slurm_connect failed: Connection refused</b></div><div><b>slurmd: debug2: Error connecting slurm stream socket at <a href="http://192.168.1.1:6817" target="_blank">192.168.1.1:6817</a>: Connection refused</b></div><div><b>slurmd: debug:  Failed to contact primary controller: Connection refused</b></div><div><b>slurmd: debug2: _slurm_connect failed: Connection refused</b></div><div><b>slurmd: debug2: Error connecting slurm stream socket at <a href="http://192.168.1.1:6817" target="_blank">192.168.1.1:6817</a>: Connection refused</b></div><div><b>slurmd: debug:  Failed to contact primary controller: Connection refused</b></div><div><b>slurmd: debug2: _slurm_connect failed: Connection refused</b></div><div><b>slurmd: debug2: Error connecting slurm stream socket at <a href="http://192.168.1.1:6817" target="_blank">192.168.1.1:6817</a>: Connection refused</b></div><div><b>slurmd: debug:  Failed to contact primary controller: Connection refused</b></div><div><b>slurmd: debug2: _slurm_connect failed: Connection refused</b></div><div><b>slurmd: debug2: Error connecting slurm stream socket at <a href="http://192.168.1.1:6817" target="_blank">192.168.1.1:6817</a>: Connection refused</b></div><div><b>slurmd: debug:  Failed to contact primary controller: Connection refused</b></div></div><div><div><b>slurmd: debug2: _slurm_connect failed: Connection refused</b></div><div><b>slurmd: debug2: Error connecting slurm stream socket at <a href="http://192.168.1.1:6817" target="_blank">192.168.1.1:6817</a>: Connection refused</b></div><div><b>slurmd: debug:  Failed to contact primary controller: Connection refused</b></div><div><b>^Cslurmd: got shutdown request</b></div><div><b>slurmd: waiting on 1 active threads</b></div><div><b>slurmd: debug2: _slurm_connect failed: Connection refused</b></div><div><b>slurmd: debug2: Error connecting slurm stream socket at <a href="http://192.168.1.1:6817" target="_blank">192.168.1.1:6817</a>: Connection refused</b></div><div><b>slurmd: debug:  Failed to contact primary controller: Connection refused</b></div><div><b>slurmd: debug2: _slurm_connect failed: Connection refused</b></div><div><b>slurmd: debug2: Error connecting slurm stream socket at <a href="http://192.168.1.1:6817" target="_blank">192.168.1.1:6817</a>: Connection refused</b></div><div><b>slurmd: debug:  Failed to contact primary controller: Connection refused</b></div><div><b>©©slurmd: debug2: _slurm_connect failed: Connection refused</b></div><div><b>slurmd: debug2: Error connecting slurm stream socket at <a href="http://192.168.1.1:6817" target="_blank">192.168.1.1:6817</a>: Connection refused</b></div><div><b>slurmd: debug:  Failed to contact primary controller: Connection refused</b></div><div><b>^C^C^C^Cslurmd: debug2: _slurm_connect failed: Connection refused</b></div><div><b>slurmd: debug2: Error connecting slurm stream socket at <a href="http://192.168.1.1:6817" target="_blank">192.168.1.1:6817</a>: Connection refused</b></div><div><b>slurmd: debug:  Failed to contact primary controller: Connection refused</b></div><div><b>slurmd: debug2: _slurm_connect failed: Connection refused</b></div><div><b>slurmd: debug2: Error connecting slurm stream socket at <a href="http://192.168.1.1:6817" target="_blank">192.168.1.1:6817</a>: Connection refused</b></div><div><b>slurmd: debug:  Failed to contact primary controller: Connection refused</b></div><div><b>slurmd: error: Unable to register: Unable to contact slurm controller (connect failure)</b></div><div><b>slurmd: debug:  Unable to register with slurm controller, retrying</b></div><div><b>slurmd: all threads complete</b></div><div><b>slurmd: Consumable Resources (CR) Node Selection plugin shutting down ...</b></div><div><b>slurmd: Munge cryptographic signature plugin unloaded</b></div><div><b>slurmd: Slurmd shutdown completing</b></div></div><div><b><br></b></div><div>which maybe it is not so bad as it seems for it may only point out that slurm is not up on the master, isn't?</div><div><b><br></b></div><div>On the master running</div><div><br></div><div><div><b>service slurmctld restart</b></div><div><b><br></b></div><div>returns</div><div><b><br></b></div><div><b>Job for slurmctld.service failed. See 'systemctl status slurmctld.service' and 'journalctl -xn' for details.</b></div><div style="font-weight:bold"><br></div></div><div>and </div><div><br></div><div><b>service slurmctld status</b></div><div><b><br></b></div><div><b>returns</b></div><div><b><br></b></div><div></div></div></div><div dir="ltr"><div><div><div><b>slurmctld.service - Slurm controller daemon</b></div><div><b>   Loaded: loaded (/lib/systemd/system/slurmctld<wbr>.service; enabled)</b></div></div></div></div><div dir="ltr"><div><div><div><b>   Active: failed (Result: timeout) since Mon 2018-01-15 18:11:20 CET; 44s ago</b></div><div><b>  Process: 2223 ExecStart=/usr/sbin/slurmctld $SLURMCTLD_OPTIONS (code=exited, status=0/SUCCESS)</b></div><div><b><br></b></div><div><b> slurmctld[2225]: cons_res: select_p_reconfigure</b></div><div><b> slurmctld[2225]: cons_res: select_p_node_init</b></div><div><b> slurmctld[2225]: cons_res: preparing for 1 partitions</b></div><div><b> slurmctld[2225]: Running as primary controller</b></div><div><b> slurmctld[2225]: SchedulerParameters=default_qu<wbr>eue_depth=100,max_rpc_cnt=0,ma<wbr>x_sched_time=4,partition_job_<wbr>depth=0</b></div><div><b> systemd[1]: slurmctld.service start operation timed out. Terminating.</b></div><div><b> slurmctld[2225]: Terminate signal (SIGINT or SIGTERM) received</b></div><div><b> slurmctld[2225]: Saving all slurm state</b></div><div><b> systemd[1]: Failed to start Slurm controller daemon.</b></div><div><b> systemd[1]: Unit slurmctld.service entered failed state.</b></div><div><b><br></b></div><div>and </div><div><b>journalctl -xn</b></div><div><br></div><div> returns no visible error</div><div style="font-weight:bold"><br></div></div><div><div><div><b>-- Logs begin at Mon 2018-01-15 18:04:38 CET, end at Mon 2018-01-15 18:17:33 CET. --</b></div><div><b>Jan 15 18:16:23 <a href="http://anyone.phys.uniroma1.it" target="_blank">anyone.phys.uniroma1.it</a> slurmctld[2286]: Saving all slurm state</b></div><div><b>Jan 15 18:16:23 <a href="http://anyone.phys.uniroma1.it" target="_blank">anyone.phys.uniroma1.it</a> systemd[1]: Failed to start Slurm controller daemon.</b></div><div><b>-- Subject: Unit slurmctld.service has failed</b></div><div><b>-- Defined-By: systemd</b></div><div><b>-- Support: <a href="http://lists.freedesktop.org/mailman/listinfo/systemd-devel" target="_blank">http://lists.freedesktop.org/m<wbr>ailman/listinfo/systemd-devel</a></b></div></div></div></div></div><div dir="ltr"><div><div><div><div><b>-- </b></div><div><b>-- Unit slurmctld.service has failed.</b></div><div><b>-- </b></div><div><b>-- The result is failed.</b></div><div><b> systemd[1]: Unit slurmctld.service entered failed state.</b></div><div><b> CRON[2312]: pam_unix(cron:session): session opened for user root by (uid=0)</b></div><div><b>CRON[2313]: (root) CMD (   cd / && run-parts --report /etc/cron.hourly)</b></div><div><b>CRON[2312]: pam_unix(cron:session): session closed for user root</b></div><div><b>dhcpd[1538]: DHCPREQUEST for 192.168.1.101 from c8:60:00:32:c6:c4 via eth1</b></div><div><b> dhcpd[1538]: DHCPACK on 192.168.1.101 to c8:60:00:32:c6:c4 via eth1</b></div><div><b>dhcpd[1538]: DHCPREQUEST for 192.168.1.102 from bc:ae:c5:12:97:75 via eth1</b></div><div><b>dhcpd[1538]: DHCPACK on 192.168.1.102 to bc:ae:c5:12:97:75 via eth1</b></div></div></div></div></div><div class="gmail_extra"><br><div class="gmail_quote">2018-01-15 16:56 GMT+01:00 Carlos Fenoy <span dir="ltr"><<a href="mailto:minibit@gmail.com" target="_blank">minibit@gmail.com</a>></span>:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr">Hi,<div><br></div><div>you can not start the slurmd on the headnode. Try running the same command on the compute nodes and check the output. If there is any issue it should display the reason.</div><div><br></div><div>Regards,</div><div>Carlos</div></div><div class="gmail_extra"><div><div class="m_6556872972326110705m_486041846320190883m_6235482560546836903h5"><br><div class="gmail_quote">On Mon, Jan 15, 2018 at 4:50 PM, Elisabetta Falivene <span dir="ltr"><<a href="mailto:e.falivene@ilabroma.com" target="_blank">e.falivene@ilabroma.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr">In the headnode. (I'm also noticing, and seems good to tell, for maybe the problem is the same, even ldap is not working as expected giving a message "invalid credential (49)" which is a message given when there are problem of this type. The update to jessie must have touched something that is affecting all my software sanity :D )<div><br><div>Here is the my slurm.conf.<div><div><br></div><div><div># slurm.conf file generated by configurator.html.</div><div># Put this file on all nodes of your cluster.</div><div># See the slurm.conf man page for more information.</div><div>#</div><div>ControlMachine=anyone</div><div>ControlAddr=master</div><div>#BackupController=</div><div>#BackupAddr=</div><div>#</div><div>AuthType=auth/munge</div><div>CacheGroups=0</div><div>#CheckpointType=checkpoint/non<wbr>e</div><div>CryptoType=crypto/munge</div><div>#DisableRootJobs=NO</div><div>#EnforcePartLimits=NO</div><div>#Epilog=</div><div>#EpilogSlurmctld=</div><div>#FirstJobId=1</div><div>#MaxJobId=999999</div><div>#GresTypes=</div><div>#GroupUpdateForce=0</div><div>#GroupUpdateTime=600</div><div>#JobCheckpointDir=/var/slurm/c<wbr>heckpoint</div><div>#JobCredentialPrivateKey=</div><div>#JobCredentialPublicCertificat<wbr>e=</div><div>#JobFileAppend=0</div><div>#JobRequeue=1</div><div>#JobSubmitPlugins=1</div><div>#KillOnBadExit=0</div><div>#Licenses=foo*4,bar</div><div>#MailProg=/bin/mail</div><div>#MaxJobCount=5000</div><div>#MaxStepCount=40000</div><div>#MaxTasksPerNode=128</div><div>MpiDefault=openmpi</div><div>MpiParams=ports=12000-12999</div><div>#PluginDir=</div><div>#PlugStackConfig=</div><div>#PrivateData=jobs</div><div>ProctrackType=proctrack/cgroup</div><div>#Prolog=</div><div>#PrologSlurmctld=</div><div>#PropagatePrioProcess=0</div><div>#PropagateResourceLimits=</div><div>#PropagateResourceLimitsExcept<wbr>=</div><div>ReturnToService=2</div><div>#SallocDefaultCommand=</div><div>SlurmctldPidFile=/var/run/slur<wbr>mctld.pid</div><div>SlurmctldPort=6817</div><div>SlurmdPidFile=/var/run/slurmd.<wbr>pid</div><div>SlurmdPort=6818</div><div>SlurmdSpoolDir=/tmp/slurmd</div><div>SlurmUser=slurm</div><div>#SlurmdUser=root</div><div>#SrunEpilog=</div><div>#SrunProlog=</div><div>StateSaveLocation=/tmp</div><div>SwitchType=switch/none</div><div>#TaskEpilog=</div><div>TaskPlugin=task/cgroup</div><div>#TaskPluginParam=</div><div>#TaskProlog=</div><div>#TopologyPlugin=topology/tree</div><div>#TmpFs=/tmp</div><div>#TrackWCKey=no</div><div>#TreeWidth=</div><div>#UnkillableStepProgram=</div><div>#UsePAM=0</div><div>#</div><div>#</div><div># TIMERS</div><div>#BatchStartTimeout=10</div><div>#CompleteWait=0</div><div>#EpilogMsgTime=2000</div><div>#GetEnvTimeout=2</div><div>#HealthCheckInterval=0</div><div>#HealthCheckProgram=</div><div>InactiveLimit=0</div><div>KillWait=60</div><div>#MessageTimeout=10</div><div>#ResvOverRun=0</div><div>MinJobAge=43200</div><div>#OverTimeLimit=0</div><div>SlurmctldTimeout=600</div><div>SlurmdTimeout=600</div><div>#UnkillableStepTimeout=60</div><div>#VSizeFactor=0</div><div>Waittime=0</div><div>#</div><div>#</div><div># SCHEDULING</div><div>DefMemPerCPU=1000</div><div>FastSchedule=1</div><div>#MaxMemPerCPU=0</div><div>#SchedulerRootFilter=1</div><div>#SchedulerTimeSlice=30</div><div>SchedulerType=sched/backfill</div><div>#SchedulerPort=</div><div>SelectType=select/cons_res</div><div>SelectTypeParameters=CR_CPU_Me<wbr>mory</div><div>#</div><div>#</div><div># JOB PRIORITY</div><div>#PriorityType=priority/basic</div><div>#PriorityDecayHalfLife=</div><div>#PriorityCalcPeriod=</div><div>#PriorityFavorSmall=</div><div>#PriorityMaxAge=</div><div>#PriorityUsageResetPeriod=</div><div>#PriorityWeightAge=</div><div>#PriorityWeightFairshare=</div><div>#PriorityWeightJobSize=</div><div>#PriorityWeightPartition=</div><div>#PriorityWeightQOS=</div><div>#</div><div>#</div><div># LOGGING AND ACCOUNTING</div><div>#AccountingStorageEnforce=0</div><div>#AccountingStorageHost=</div><div>AccountingStorageLoc=/var/log/<wbr>slurm-llnl/AccountingStorage.l<wbr>og</div><div>#AccountingStoragePass=</div><div>#AccountingStoragePort=</div><div>AccountingStorageType=accounti<wbr>ng_storage/filetxt</div><div>#AccountingStorageUser=</div><div>AccountingStoreJobComment=YES</div><div>ClusterName=cluster</div><div>#DebugFlags=</div><div>#JobCompHost=</div><div>JobCompLoc=/var/log/slurm-llnl<wbr>/JobComp.log</div><div>#JobCompPass=</div><div>#JobCompPort=</div><div>JobCompType=jobcomp/filetxt</div><div>#JobCompUser=</div><div>JobAcctGatherFrequency=60</div><div>JobAcctGatherType=jobacct_gath<wbr>er/linux</div><div>SlurmctldDebug=3</div><div>#SlurmctldLogFile=</div><div>SlurmdDebug=3</div><div>#SlurmdLogFile=</div><div>#SlurmSchedLogFile=</div><div>#SlurmSchedLogLevel=</div><div>#</div><div>#</div><div># POWER SAVE SUPPORT FOR IDLE NODES (optional)</div><div>#SuspendProgram=</div><div>#ResumeProgram=</div><div>#SuspendTimeout=</div><div>#ResumeTimeout=</div><div>#ResumeRate=</div><div>#SuspendExcNodes=</div><div>#SuspendExcParts=</div><div>#SuspendRate=</div><div>#SuspendTime=</div><div>#</div><div>#</div><div># COMPUTE NODES</div><div>NodeName=node[01-08] CPUs=16 RealMemory=16000 State=UNKNOWN</div><div>PartitionName=batch Nodes=node[01-08] Default=YES MaxTime=INFINITE State=UP</div></div><div><br></div></div></div></div></div><div class="m_6556872972326110705m_486041846320190883m_6235482560546836903m_-7886990152372667557HOEnZb"><div class="m_6556872972326110705m_486041846320190883m_6235482560546836903m_-7886990152372667557h5"><div class="gmail_extra"><br><div class="gmail_quote">2018-01-15 16:43 GMT+01:00 Carlos Fenoy <span dir="ltr"><<a href="mailto:minibit@gmail.com" target="_blank">minibit@gmail.com</a>></span>:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr">Are you trying to start the slurmd in the headnode or a compute node?<div><br></div><div>Can you provide the slurm.conf file?</div><div><br></div><div>Regards,</div><div>Carlos</div></div><div class="gmail_extra"><div><div class="m_6556872972326110705m_486041846320190883m_6235482560546836903m_-7886990152372667557m_1226285621552704478h5"><br><div class="gmail_quote">On Mon, Jan 15, 2018 at 4:30 PM, Elisabetta Falivene <span dir="ltr"><<a href="mailto:e.falivene@ilabroma.com" target="_blank">e.falivene@ilabroma.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr"><div>slurmd -Dvvv says</div><div><br></div><div>slurmd: fatal: Unable to determine this slurmd's NodeName</div><div><br></div><div>b</div><div><div class="m_6556872972326110705m_486041846320190883m_6235482560546836903m_-7886990152372667557m_1226285621552704478m_-9093742716976473048h5"><div class="gmail_extra"><br><div class="gmail_quote">2018-01-15 15:58 GMT+01:00 Douglas Jacobsen <span dir="ltr"><<a href="mailto:dmjacobsen@lbl.gov" target="_blank">dmjacobsen@lbl.gov</a>></span>:<br><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="auto">The fact that sinfo is responding shows that at least slurmctld is running.  Slumd, on the other hand is not.  Please also get output of slurmd log or running "slurmd -Dvvv"</div></blockquote><div><br></div><div><br></div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div class="m_6556872972326110705m_486041846320190883m_6235482560546836903m_-7886990152372667557m_1226285621552704478m_-9093742716976473048m_5348135526455917116gmail-HOEnZb"><div class="m_6556872972326110705m_486041846320190883m_6235482560546836903m_-7886990152372667557m_1226285621552704478m_-9093742716976473048m_5348135526455917116gmail-h5"><div class="gmail_extra"><br><div class="gmail_quote">On Jan 15, 2018 06:42, "Elisabetta Falivene" <<a href="mailto:e.falivene@ilabroma.com" target="_blank">e.falivene@ilabroma.com</a>> wrote:<br type="attribution"><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><span style="font-size:12.8px">> Anyway I suggest to update the operating system to stretch and fix your</span><br style="font-size:12.8px"><span style="font-size:12.8px">> configuration under a more recent version of slurm.</span><br><div><span style="font-size:12.8px"><br></span></div><div><span style="font-size:12.8px">I think I'll soon arrive to that :)</span></div><div><span style="font-size:12.8px">b</span></div></div><div class="gmail_extra"><br><div class="gmail_quote">2018-01-15 14:08 GMT+01:00 Gennaro Oliva <span dir="ltr"><<a href="mailto:oliva.g@na.icar.cnr.it" target="_blank">oliva.g@na.icar.cnr.it</a>></span>:<br><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">Ciao Elisabetta,<br>

<span><br>

On Mon, Jan 15, 2018 at 01:13:27PM +0100, Elisabetta Falivene wrote:<br>

> Error messages are not much helping me in guessing what is going on. What<br>

> should I check to get what is failing?<br>

<br>

</span>check slurmctld.log and slurmd.log, you can find them under<br>

/var/log/slurm-llnl<br>

<br>

> *PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST*<br>

> *batch*       up   infinite      8   unk* node[01-08]*<br>

><br>

><br>

> Running<br>

> *systemctl status slurmctld.service*<br>

><br>

> returns<br>

><br>

> *slurmctld.service - Slurm controller daemon*<br>

> *   Loaded: loaded (/lib/systemd/system/slurmctld<wbr>.service; enabled)*<br>

> *   Active: failed (Result: timeout) since Mon 2018-01-15 13:03:39 CET; 41s<br>

> ago*<br>

> *  Process: 2098 ExecStart=/usr/sbin/slurmctld $SLURMCTLD_OPTIONS<br>

> (code=exited, status=0/SUCCESS)*<br>

><br>

> * slurmctld[2100]: cons_res: select_p_reconfigure*<br>

> * slurmctld[2100]: cons_res: select_p_node_init*<br>

> * slurmctld[2100]: cons_res: preparing for 1 partitions*<br>

> * slurmctld[2100]: Running as primary controller*<br>

> * slurmctld[2100]:<br>

> SchedulerParameters=default_qu<wbr>eue_depth=100,max_rpc_cnt=0,ma<wbr>x_sched_time=4,partition_job_<wbr>depth=0*<br>

> * slurmctld.service start operation timed out. Terminating.*<br>

> *Terminate signal (SIGINT or SIGTERM) received*<br>

> * slurmctld[2100]: Saving all slurm state*<br>

> * Failed to start Slurm controller daemon.*<br>

> * Unit slurmctld.service entered failed state.*<br>

<br>

Do you have a backup controller?<br>

Check your slurm.conf under:<br>

/etc/slurm-llnl<br>

<br>

Anyway I suggest to update the operating system to stretch and fix your<br>

configuration under a more recent version of slurm.<br>

Best regards<br>

<span class="m_6556872972326110705m_486041846320190883m_6235482560546836903m_-7886990152372667557m_1226285621552704478m_-9093742716976473048m_5348135526455917116gmail-m_-4769155202419504202m_1324718862540659117HOEnZb"><font color="#888888">--<br>

Gennaro Oliva<br>

<br>

</font></span></blockquote></div><br></div>

</blockquote></div></div>

</div></div></blockquote></div><br></div></div></div></div>

</blockquote></div><br><br clear="all"><div><br></div></div></div><span class="m_6556872972326110705m_486041846320190883m_6235482560546836903m_-7886990152372667557m_1226285621552704478HOEnZb"><font color="#888888">-- <br><div class="m_6556872972326110705m_486041846320190883m_6235482560546836903m_-7886990152372667557m_1226285621552704478m_-9093742716976473048gmail_signature" data-smartmail="gmail_signature">--<br>Carles Fenoy<br></div>

</font></span></div>

</blockquote></div><br></div>

</div></div></blockquote></div><br><br clear="all"><div><br></div></div></div><span class="m_6556872972326110705m_486041846320190883m_6235482560546836903HOEnZb"><font color="#888888">-- <br><div class="m_6556872972326110705m_486041846320190883m_6235482560546836903m_-7886990152372667557gmail_signature" data-smartmail="gmail_signature">--<br>Carles Fenoy<br></div>

</font></span></div>

</blockquote></div><br></div>

</blockquote></div>

</div></div></blockquote></div></div></div><br></div></div>

</blockquote></div><br></div>