<div dir="ltr">Hi,<div><br></div><div>you can not start the slurmd on the headnode. Try running the same command on the compute nodes and check the output. If there is any issue it should display the reason.</div><div><br></div><div>Regards,</div><div>Carlos</div></div><div class="gmail_extra"><br><div class="gmail_quote">On Mon, Jan 15, 2018 at 4:50 PM, Elisabetta Falivene <span dir="ltr"><<a href="mailto:e.falivene@ilabroma.com" target="_blank">e.falivene@ilabroma.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr">In the headnode. (I'm also noticing, and seems good to tell, for maybe the problem is the same, even ldap is not working as expected giving a message "invalid credential (49)" which is a message given when there are problem of this type. The update to jessie must have touched something that is affecting all my software sanity :D )<div><br><div>Here is the my slurm.conf.<div><div><br></div><div><div># slurm.conf file generated by configurator.html.</div><div># Put this file on all nodes of your cluster.</div><div># See the slurm.conf man page for more information.</div><div>#</div><div>ControlMachine=anyone</div><div>ControlAddr=master</div><div>#BackupController=</div><div>#BackupAddr=</div><div>#</div><div>AuthType=auth/munge</div><div>CacheGroups=0</div><div>#CheckpointType=checkpoint/<wbr>none</div><div>CryptoType=crypto/munge</div><div>#DisableRootJobs=NO</div><div>#EnforcePartLimits=NO</div><div>#Epilog=</div><div>#EpilogSlurmctld=</div><div>#FirstJobId=1</div><div>#MaxJobId=999999</div><div>#GresTypes=</div><div>#GroupUpdateForce=0</div><div>#GroupUpdateTime=600</div><div>#JobCheckpointDir=/var/slurm/<wbr>checkpoint</div><div>#JobCredentialPrivateKey=</div><div>#<wbr>JobCredentialPublicCertificate<wbr>=</div><div>#JobFileAppend=0</div><div>#JobRequeue=1</div><div>#JobSubmitPlugins=1</div><div>#KillOnBadExit=0</div><div>#Licenses=foo*4,bar</div><div>#MailProg=/bin/mail</div><div>#MaxJobCount=5000</div><div>#MaxStepCount=40000</div><div>#MaxTasksPerNode=128</div><div>MpiDefault=openmpi</div><div>MpiParams=ports=12000-12999</div><div>#PluginDir=</div><div>#PlugStackConfig=</div><div>#PrivateData=jobs</div><div>ProctrackType=proctrack/cgroup</div><div>#Prolog=</div><div>#PrologSlurmctld=</div><div>#PropagatePrioProcess=0</div><div>#PropagateResourceLimits=</div><div>#<wbr>PropagateResourceLimitsExcept=</div><div>ReturnToService=2</div><div>#SallocDefaultCommand=</div><div>SlurmctldPidFile=/var/run/<wbr>slurmctld.pid</div><div>SlurmctldPort=6817</div><div>SlurmdPidFile=/var/run/slurmd.<wbr>pid</div><div>SlurmdPort=6818</div><div>SlurmdSpoolDir=/tmp/slurmd</div><div>SlurmUser=slurm</div><div>#SlurmdUser=root</div><div>#SrunEpilog=</div><div>#SrunProlog=</div><div>StateSaveLocation=/tmp</div><div>SwitchType=switch/none</div><div>#TaskEpilog=</div><div>TaskPlugin=task/cgroup</div><div>#TaskPluginParam=</div><div>#TaskProlog=</div><div>#TopologyPlugin=topology/tree</div><div>#TmpFs=/tmp</div><div>#TrackWCKey=no</div><div>#TreeWidth=</div><div>#UnkillableStepProgram=</div><div>#UsePAM=0</div><div>#</div><div>#</div><div># TIMERS</div><div>#BatchStartTimeout=10</div><div>#CompleteWait=0</div><div>#EpilogMsgTime=2000</div><div>#GetEnvTimeout=2</div><div>#HealthCheckInterval=0</div><div>#HealthCheckProgram=</div><div>InactiveLimit=0</div><div>KillWait=60</div><div>#MessageTimeout=10</div><div>#ResvOverRun=0</div><div>MinJobAge=43200</div><div>#OverTimeLimit=0</div><div>SlurmctldTimeout=600</div><div>SlurmdTimeout=600</div><div>#UnkillableStepTimeout=60</div><div>#VSizeFactor=0</div><div>Waittime=0</div><div>#</div><div>#</div><div># SCHEDULING</div><div>DefMemPerCPU=1000</div><div>FastSchedule=1</div><div>#MaxMemPerCPU=0</div><div>#SchedulerRootFilter=1</div><div>#SchedulerTimeSlice=30</div><div>SchedulerType=sched/backfill</div><div>#SchedulerPort=</div><div>SelectType=select/cons_res</div><div>SelectTypeParameters=CR_CPU_<wbr>Memory</div><div>#</div><div>#</div><div># JOB PRIORITY</div><div>#PriorityType=priority/basic</div><div>#PriorityDecayHalfLife=</div><div>#PriorityCalcPeriod=</div><div>#PriorityFavorSmall=</div><div>#PriorityMaxAge=</div><div>#PriorityUsageResetPeriod=</div><div>#PriorityWeightAge=</div><div>#PriorityWeightFairshare=</div><div>#PriorityWeightJobSize=</div><div>#PriorityWeightPartition=</div><div>#PriorityWeightQOS=</div><div>#</div><div>#</div><div># LOGGING AND ACCOUNTING</div><div>#AccountingStorageEnforce=0</div><div>#AccountingStorageHost=</div><div>AccountingStorageLoc=/var/log/<wbr>slurm-llnl/AccountingStorage.<wbr>log</div><div>#AccountingStoragePass=</div><div>#AccountingStoragePort=</div><div>AccountingStorageType=<wbr>accounting_storage/filetxt</div><div>#AccountingStorageUser=</div><div>AccountingStoreJobComment=YES</div><div>ClusterName=cluster</div><div>#DebugFlags=</div><div>#JobCompHost=</div><div>JobCompLoc=/var/log/slurm-<wbr>llnl/JobComp.log</div><div>#JobCompPass=</div><div>#JobCompPort=</div><div>JobCompType=jobcomp/filetxt</div><div>#JobCompUser=</div><div>JobAcctGatherFrequency=60</div><div>JobAcctGatherType=jobacct_<wbr>gather/linux</div><div>SlurmctldDebug=3</div><div>#SlurmctldLogFile=</div><div>SlurmdDebug=3</div><div>#SlurmdLogFile=</div><div>#SlurmSchedLogFile=</div><div>#SlurmSchedLogLevel=</div><div>#</div><div>#</div><div># POWER SAVE SUPPORT FOR IDLE NODES (optional)</div><div>#SuspendProgram=</div><div>#ResumeProgram=</div><div>#SuspendTimeout=</div><div>#ResumeTimeout=</div><div>#ResumeRate=</div><div>#SuspendExcNodes=</div><div>#SuspendExcParts=</div><div>#SuspendRate=</div><div>#SuspendTime=</div><div>#</div><div>#</div><div># COMPUTE NODES</div><div>NodeName=node[01-08] CPUs=16 RealMemory=16000 State=UNKNOWN</div><div>PartitionName=batch Nodes=node[01-08] Default=YES MaxTime=INFINITE State=UP</div></div><div><br></div></div></div></div></div><div class="HOEnZb"><div class="h5"><div class="gmail_extra"><br><div class="gmail_quote">2018-01-15 16:43 GMT+01:00 Carlos Fenoy <span dir="ltr"><<a href="mailto:minibit@gmail.com" target="_blank">minibit@gmail.com</a>></span>:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr">Are you trying to start the slurmd in the headnode or a compute node?<div><br></div><div>Can you provide the slurm.conf file?</div><div><br></div><div>Regards,</div><div>Carlos</div></div><div class="gmail_extra"><div><div class="m_1226285621552704478h5"><br><div class="gmail_quote">On Mon, Jan 15, 2018 at 4:30 PM, Elisabetta Falivene <span dir="ltr"><<a href="mailto:e.falivene@ilabroma.com" target="_blank">e.falivene@ilabroma.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr"><div>slurmd -Dvvv says</div><div><br></div><div>slurmd: fatal: Unable to determine this slurmd's NodeName</div><div><br></div><div>b</div><div><div class="m_1226285621552704478m_-9093742716976473048h5"><div class="gmail_extra"><br><div class="gmail_quote">2018-01-15 15:58 GMT+01:00 Douglas Jacobsen <span dir="ltr"><<a href="mailto:dmjacobsen@lbl.gov" target="_blank">dmjacobsen@lbl.gov</a>></span>:<br><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="auto">The fact that sinfo is responding shows that at least slurmctld is running.  Slumd, on the other hand is not.  Please also get output of slurmd log or running "slurmd -Dvvv"</div></blockquote><div><br></div><div><br></div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div class="m_1226285621552704478m_-9093742716976473048m_5348135526455917116gmail-HOEnZb"><div class="m_1226285621552704478m_-9093742716976473048m_5348135526455917116gmail-h5"><div class="gmail_extra"><br><div class="gmail_quote">On Jan 15, 2018 06:42, "Elisabetta Falivene" <<a href="mailto:e.falivene@ilabroma.com" target="_blank">e.falivene@ilabroma.com</a>> wrote:<br type="attribution"><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><span style="font-size:12.8px">> Anyway I suggest to update the operating system to stretch and fix your</span><br style="font-size:12.8px"><span style="font-size:12.8px">> configuration under a more recent version of slurm.</span><br><div><span style="font-size:12.8px"><br></span></div><div><span style="font-size:12.8px">I think I'll soon arrive to that :)</span></div><div><span style="font-size:12.8px">b</span></div></div><div class="gmail_extra"><br><div class="gmail_quote">2018-01-15 14:08 GMT+01:00 Gennaro Oliva <span dir="ltr"><<a href="mailto:oliva.g@na.icar.cnr.it" target="_blank">oliva.g@na.icar.cnr.it</a>></span>:<br><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">Ciao Elisabetta,<br>
<span><br>
On Mon, Jan 15, 2018 at 01:13:27PM +0100, Elisabetta Falivene wrote:<br>
> Error messages are not much helping me in guessing what is going on. What<br>
> should I check to get what is failing?<br>
<br>
</span>check slurmctld.log and slurmd.log, you can find them under<br>
/var/log/slurm-llnl<br>
<br>
> *PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST*<br>
> *batch*       up   infinite      8   unk* node[01-08]*<br>
><br>
><br>
> Running<br>
> *systemctl status slurmctld.service*<br>
><br>
> returns<br>
><br>
> *slurmctld.service - Slurm controller daemon*<br>
> *   Loaded: loaded (/lib/systemd/system/slurmctld<wbr>.service; enabled)*<br>
> *   Active: failed (Result: timeout) since Mon 2018-01-15 13:03:39 CET; 41s<br>
> ago*<br>
> *  Process: 2098 ExecStart=/usr/sbin/slurmctld $SLURMCTLD_OPTIONS<br>
> (code=exited, status=0/SUCCESS)*<br>
><br>
> * slurmctld[2100]: cons_res: select_p_reconfigure*<br>
> * slurmctld[2100]: cons_res: select_p_node_init*<br>
> * slurmctld[2100]: cons_res: preparing for 1 partitions*<br>
> * slurmctld[2100]: Running as primary controller*<br>
> * slurmctld[2100]:<br>
> SchedulerParameters=default_qu<wbr>eue_depth=100,max_rpc_cnt=0,ma<wbr>x_sched_time=4,partition_job_d<wbr>epth=0*<br>
> * slurmctld.service start operation timed out. Terminating.*<br>
> *Terminate signal (SIGINT or SIGTERM) received*<br>
> * slurmctld[2100]: Saving all slurm state*<br>
> * Failed to start Slurm controller daemon.*<br>
> * Unit slurmctld.service entered failed state.*<br>
<br>
Do you have a backup controller?<br>
Check your slurm.conf under:<br>
/etc/slurm-llnl<br>
<br>
Anyway I suggest to update the operating system to stretch and fix your<br>
configuration under a more recent version of slurm.<br>
Best regards<br>
<span class="m_1226285621552704478m_-9093742716976473048m_5348135526455917116gmail-m_-4769155202419504202m_1324718862540659117HOEnZb"><font color="#888888">--<br>
Gennaro Oliva<br>
<br>
</font></span></blockquote></div><br></div>
</blockquote></div></div>
</div></div></blockquote></div><br></div></div></div></div>
</blockquote></div><br><br clear="all"><div><br></div></div></div><span class="m_1226285621552704478HOEnZb"><font color="#888888">-- <br><div class="m_1226285621552704478m_-9093742716976473048gmail_signature" data-smartmail="gmail_signature">--<br>Carles Fenoy<br></div>
</font></span></div>
</blockquote></div><br></div>
</div></div></blockquote></div><br><br clear="all"><div><br></div>-- <br><div class="gmail_signature" data-smartmail="gmail_signature">--<br>Carles Fenoy<br></div>
</div>