<div dir="auto"><br></div><div dir="auto"><br><br><div dir="auto"><div>Sorry for the delay, was trying to fix it but still not working.</div><div><br></div><div>The node is always down. The master machine is also the compute machine. It's a single server that i use for that. 1 node and 12 cpus.</div><div><br></div><div>In the log below i see this line</div><div><div>[2017-11-30T09:24:41.764] agent/is_node_resp: node:linuxcluster RPC:REQUEST_NODE_REGISTRATION_STATUS : Communication connection failure</div><div><br></div></div><div><br></div><div>Here below my slurm.conf file:</div><div><br></div><div><div>ControlMachine=linuxcluster<br></div><div>AuthType=auth/munge</div><div>CryptoType=crypto/munge</div><div>MailProg=/usr/bin/mail</div><div>MpiDefault=none</div><div>PluginDir=/usr/local/lib/slurm</div><div>ProctrackType=proctrack/cgroup</div><div>ReturnToService=1</div><div>SlurmctldPidFile=/var/run/slurmctld.pid</div><div>SlurmctldPort=6817</div><div>SlurmdPidFile=/var/run/slurmd.pid</div><div>SlurmdPort=6818</div><div>SlurmdSpoolDir=/var/spool/slurm/d</div><div>SlurmUser=slurm</div><div>StateSaveLocation=/var/spool/slurm/ctld</div><div>SwitchType=switch/none</div><div>TaskPlugin=task/none</div><div>InactiveLimit=0</div><div>KillWait=30</div><div>MinJobAge=300</div><div>SlurmctldTimeout=120</div><div>SlurmdTimeout=300</div><div>Waittime=0</div><div>FastSchedule=1</div><div>SchedulerType=sched/backfill</div><div>AccountingStorageHost=linuxcluster</div><div>AccountingStorageType=accounting_storage/slurmdbd</div><div>AccountingStorageUser=slurm</div><div>AccountingStoreJobComment=YES</div><div>ClusterName=linuxcluster</div><div>JobCompType=jobcomp/none</div><div>JobCompUser=slurm</div><div>JobAcctGatherFrequency=30</div><div>JobAcctGatherType=jobacct_gather/cgroup</div><div>SlurmctldDebug=5</div><div>SlurmctldLogFile=/var/log/slurm/slurmctrl.log</div><div>SlurmdDebug=5</div><div><br></div><div>SelectType=select/cons_res</div><div>SelectTypeParameters=CR_CPU</div><div>NodeName=linuxcluster CPUs=12</div><div>PartitionName=testq Nodes=linuxclusterDefault=YES MaxTime=INFINITE State=UP</div></div><div><br></div><div><br></div><div>slurmctrld.log:</div><div><div>[2017-11-30T09:24:28.025] debug: Log file re-opened<br></div><div>[2017-11-30T09:24:28.025] debug: sched: slurmctld starting</div><div>[2017-11-30T09:24:28.025] slurmctld version 17.11.0 started on cluster linuxcluster</div><div>[2017-11-30T09:24:28.026] Munge cryptographic signature plugin loaded</div><div>[2017-11-30T09:24:28.026] Consumable Resources (CR) Node Selection plugin loaded with argument 1</div><div>[2017-11-30T09:24:28.026] preempt/none loaded</div><div>[2017-11-30T09:24:28.026] debug: Checkpoint plugin loaded: checkpoint/none</div><div>[2017-11-30T09:24:28.026] debug: AcctGatherEnergy NONE plugin loaded</div><div>[2017-11-30T09:24:28.026] debug: AcctGatherProfile NONE plugin loaded</div><div>[2017-11-30T09:24:28.026] debug: AcctGatherInterconnect NONE plugin loaded</div><div>[2017-11-30T09:24:28.026] debug: AcctGatherFilesystem NONE plugin loaded</div><div>[2017-11-30T09:24:28.026] debug: Job accounting gather cgroup plugin loaded</div><div>[2017-11-30T09:24:28.026] ExtSensors NONE plugin loaded</div><div>[2017-11-30T09:24:28.026] debug: switch NONE plugin loaded</div><div>[2017-11-30T09:24:28.026] debug: power_save module disabled, SuspendTime < 0</div><div>[2017-11-30T09:24:28.026] debug: No backup controller to shutdown</div><div>[2017-11-30T09:24:28.026] Accounting storage SLURMDBD plugin loaded with AuthInfo=(null)</div><div>[2017-11-30T09:24:28.027] debug: Munge authentication plugin loaded</div><div>[2017-11-30T09:24:28.030] debug: slurmdbd: Sent PersistInit msg</div><div>[2017-11-30T09:24:28.030] slurmdbd: recovered 0 pending RPCs</div><div>[2017-11-30T09:24:28.429] debug: Reading slurm.conf file: /usr/local/etc/slurm.conf</div><div>[2017-11-30T09:24:28.430] layouts: no layout to initialize</div><div>[2017-11-30T09:24:28.430] topology NONE plugin loaded</div><div>[2017-11-30T09:24:28.430] debug: No DownNodes</div><div>[2017-11-30T09:24:28.435] debug: Log file re-opened</div><div>[2017-11-30T09:24:28.435] sched: Backfill scheduler plugin loaded</div><div>[2017-11-30T09:24:28.435] route default plugin loaded</div><div>[2017-11-30T09:24:28.435] layouts: loading entities/relations information</div><div>[2017-11-30T09:24:28.435] debug: layouts: 1/1 nodes in hash table, rc=0</div><div>[2017-11-30T09:24:28.435] debug: layouts: loading stage 1</div><div>[2017-11-30T09:24:28.435] debug: layouts: loading stage 1.1 (restore state)</div><div>[2017-11-30T09:24:28.435] debug: layouts: loading stage 2</div><div>[2017-11-30T09:24:28.435] debug: layouts: loading stage 3</div><div>[2017-11-30T09:24:28.435] Recovered state of 1 nodes</div><div>[2017-11-30T09:24:28.435] Down nodes: linuxcluster</div><div>[2017-11-30T09:24:28.435] Recovered JobID=15 State=0x4 NodeCnt=0 Assoc=6</div><div>[2017-11-30T09:24:28.435] Recovered information about 1 jobs</div><div>[2017-11-30T09:24:28.435] cons_res: select_p_node_init</div><div>[2017-11-30T09:24:28.436] cons_res: preparing for 1 partitions</div><div>[2017-11-30T09:24:28.436] debug: Updating partition uid access list</div><div>[2017-11-30T09:24:28.436] Recovered state of 0 reservations</div><div>[2017-11-30T09:24:28.436] State of 0 triggers recovered</div><div>[2017-11-30T09:24:28.436] _preserve_plugins: backup_controller not specified</div><div>[2017-11-30T09:24:28.436] cons_res: select_p_reconfigure</div><div>[2017-11-30T09:24:28.436] cons_res: select_p_node_init</div><div>[2017-11-30T09:24:28.436] cons_res: preparing for 1 partitions</div><div>[2017-11-30T09:24:28.436] Running as primary controller</div><div>[2017-11-30T09:24:28.436] debug: No BackupController, not launching heartbeat.</div><div>[2017-11-30T09:24:28.436] Registering slurmctld at port 6817 with slurmdbd.</div><div>[2017-11-30T09:24:28.677] debug: No feds to retrieve from state</div><div>[2017-11-30T09:24:28.757] debug: Priority BASIC plugin loaded</div><div>[2017-11-30T09:24:28.758] No parameter for mcs plugin, default values set</div><div>[2017-11-30T09:24:28.758] mcs: MCSParameters = (null). ondemand set.</div><div>[2017-11-30T09:24:28.758] debug: mcs none plugin loaded</div><div>[2017-11-30T09:24:28.758] debug: power_save mode not enabled</div><div>[2017-11-30T09:24:31.761] debug: Spawning registration agent for linuxcluster1 hosts</div></div><div><div>[2017-11-30T09:24:41.764] agent/is_node_resp: node:linuxcluster RPC:REQUEST_NODE_REGISTRATION_STATUS : Communication connection failure</div><div>[2017-11-30T09:24:58.435] debug: backfill: beginning</div><div>[2017-11-30T09:24:58.435] debug: backfill: no jobs to backfill</div><div>[2017-11-30T09:25:28.435] debug: backfill: beginning</div><div>[2017-11-30T09:25:28.436] debug: backfill: no jobs to backfill</div><div>[2017-11-30T09:25:28.830] SchedulerParameters=default_queue_depth=100,max_rpc_cnt=0,max_sched_time=2,partition_job_depth=0,sched_max_job_sta</div><div>rt=0,sched_min_interval=2</div><div>[2017-11-30T09:25:28.830] debug: sched: Running job scheduler</div><div>[2017-11-30T09:25:58.436] debug: backfill: beginning</div><div>[2017-11-30T09:25:58.436] debug: backfill: no jobs to backfill</div></div><div><br></div><br clear="all"><div>ps -ef | grep slurm</div><div><div>ubuntu@linuxcluster:/home/dvi/$ ps -ef | grep slurm</div><div>slurm 11388 1 0 09:24 ? 00:00:00 /usr/local/sbin/slurmdbd</div><div>slurm 11430 1 0 09:24 ? 00:00:00 /usr/local/sbin/slurmctld</div><div><br></div></div><div>Any idea ?</div></div><div dir="auto"><br></div></div><div dir="auto"><br></div><div dir="auto"><br></div><div dir="auto"><br></div><div><br><div class="gmail_quote"><div>El El mié, 29 nov 2017 a las 18:21, Le Biot, Pierre-Marie <<a href="mailto:pierre-marie.lebiot@hpe.com">pierre-marie.lebiot@hpe.com</a>> escribió:<br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<div lang="FR" link="blue" vlink="purple">
<div class="m_-2139995522747173503WordSection1">
<p class="MsoNormal"><span lang="EN-US" style="font-size:11.0pt;font-family:"Calibri",sans-serif;color:#1f497d">Hello David,<u></u><u></u></span></p>
<p class="MsoNormal"><span lang="EN-US" style="font-size:11.0pt;font-family:"Calibri",sans-serif;color:#1f497d"><u></u> <u></u></span></p>
<p class="MsoNormal"><span lang="EN-US" style="font-size:11.0pt;font-family:"Calibri",sans-serif;color:#1f497d">So linuxcluster is the Head node and also a Compute node ?<u></u><u></u></span></p>
<p class="MsoNormal"><span lang="EN-US" style="font-size:11.0pt;font-family:"Calibri",sans-serif;color:#1f497d"><u></u> <u></u></span></p>
<p class="MsoNormal"><span lang="EN-US" style="font-size:11.0pt;font-family:"Calibri",sans-serif;color:#1f497d">Is slurmd running ?<u></u><u></u></span></p>
<p class="MsoNormal"><span lang="EN-US" style="font-size:11.0pt;font-family:"Calibri",sans-serif;color:#1f497d"><u></u> <u></u></span></p>
<p class="MsoNormal"><span lang="EN-US" style="font-size:11.0pt;font-family:"Calibri",sans-serif;color:#1f497d">What does /var/log/slurm/slurmd.log say ?<u></u><u></u></span></p>
<p class="MsoNormal"><span lang="EN-US" style="font-size:11.0pt;font-family:"Calibri",sans-serif;color:#1f497d"><u></u> <u></u></span></p>
<p class="MsoNormal"><span lang="EN-US" style="font-size:11.0pt;font-family:"Calibri",sans-serif;color:#1f497d">Regards,<u></u><u></u></span></p>
<p class="MsoNormal"><span lang="EN-US" style="font-size:11.0pt;font-family:"Calibri",sans-serif;color:#1f497d">Pierre-Marie Le Biot<u></u><u></u></span></p>
<p class="MsoNormal"><span lang="EN-US"><u></u> <u></u></span></p>
<p class="MsoNormal"><span lang="EN-US"><u></u> <u></u></span></p>
<p class="MsoNormal"><b><span lang="EN-US" style="font-size:11.0pt;font-family:"Calibri",sans-serif">From:</span></b><span lang="EN-US" style="font-size:11.0pt;font-family:"Calibri",sans-serif"> slurm-users [mailto:<a href="mailto:slurm-users-bounces@lists.schedmd.com" target="_blank">slurm-users-bounces@lists.schedmd.com</a>]
<b>On Behalf Of </b>david vilanova<br>
<b>Sent:</b> Wednesday, November 29, 2017 4:33 PM<br>
<b>To:</b> Slurm User Community List <<a href="mailto:slurm-users@lists.schedmd.com" target="_blank">slurm-users@lists.schedmd.com</a>><br>
<b>Subject:</b> Re: [slurm-users] slurm conf with single machine with multi cores.<u></u><u></u></span></p></div></div><div lang="FR" link="blue" vlink="purple"><div class="m_-2139995522747173503WordSection1">
<p class="MsoNormal"><u></u> <u></u></p>
<div>
<div>
<p class="MsoNormal"><span style="color:#313131">Hi,<u></u><u></u></span></p>
</div>
<div>
<p class="MsoNormal"><span style="color:#313131">I have updated the slurm.conf as follows:<u></u><u></u></span></p>
</div>
<p class="MsoNormal"><span style="color:#313131"><br clear="all" style="word-spacing:1px">
</span><u></u><u></u></p>
<div>
<div>
<p class="MsoNormal"><span style="color:#313131">SelectType=select/cons_res<u></u><u></u></span></p>
</div>
<div>
<p class="MsoNormal"><span style="color:#313131">SelectTypeParameters=CR_CPU<u></u><u></u></span></p>
</div>
<div>
<p class="MsoNormal"><span style="color:#313131">NodeName=linuxcluster CPUs=2<u></u><u></u></span></p>
</div>
<div>
<p class="MsoNormal"><span style="color:#313131">PartitionName=testq Nodes=linuxcluster Default=YES MaxTime=INFINITE State=UP<u></u><u></u></span></p>
</div>
</div>
<div>
<p class="MsoNormal"><span style="color:#313131"><u></u> <u></u></span></p>
</div>
<div>
<p class="MsoNormal"><span style="color:#313131">Still get testq node in down status ??? Any idea ?<u></u><u></u></span></p>
</div>
<div>
<p class="MsoNormal"><span style="color:#313131"><u></u> <u></u></span></p>
</div>
<div>
<p class="MsoNormal"><span style="color:#313131">Below log from db and controller:<u></u><u></u></span></p>
</div>
<div>
<div>
<p class="MsoNormal"><span style="color:#313131">==> /var/log/slurm/slurmctrl.log <==<u></u><u></u></span></p>
</div>
<div>
<p class="MsoNormal"><span style="color:#313131">[2017-11-29T16:28:30.446] slurmctld version 17.11.0 started on cluster linuxcluster<u></u><u></u></span></p>
</div>
<div>
<p class="MsoNormal"><span style="color:#313131">[2017-11-29T16:28:30.850] error: SelectType specified more than once, latest value used<u></u><u></u></span></p>
</div>
<div>
<p class="MsoNormal"><span style="color:#313131">[2017-11-29T16:28:30.851] layouts: no layout to initialize<u></u><u></u></span></p>
</div>
<div>
<p class="MsoNormal"><span style="color:#313131">[2017-11-29T16:28:30.855] layouts: loading entities/relations information<u></u><u></u></span></p>
</div>
<div>
<p class="MsoNormal"><span style="color:#313131">[2017-11-29T16:28:30.855] Recovered state of 1 nodes<u></u><u></u></span></p>
</div>
<div>
<p class="MsoNormal"><span style="color:#313131">[2017-11-29T16:28:30.855] Down nodes: linuxcluster<u></u><u></u></span></p>
</div>
<div>
<p class="MsoNormal"><span style="color:#313131">[2017-11-29T16:28:30.855] Recovered information about 0 jobs<u></u><u></u></span></p>
</div>
<div>
<p class="MsoNormal"><span style="color:#313131">[2017-11-29T16:28:30.855] cons_res: select_p_node_init<u></u><u></u></span></p>
</div>
<div>
<p class="MsoNormal"><span style="color:#313131">[2017-11-29T16:28:30.855] cons_res: preparing for 1 partitions<u></u><u></u></span></p>
</div>
<div>
<p class="MsoNormal"><span style="color:#313131">[2017-11-29T16:28:30.856] Recovered state of 0 reservations<u></u><u></u></span></p>
</div>
<div>
<p class="MsoNormal"><span style="color:#313131">[2017-11-29T16:28:30.856] _preserve_plugins: backup_controller not specified<u></u><u></u></span></p>
</div>
<div>
<p class="MsoNormal"><span style="color:#313131">[2017-11-29T16:28:30.856] cons_res: select_p_reconfigure<u></u><u></u></span></p>
</div>
<div>
<p class="MsoNormal"><span style="color:#313131">[2017-11-29T16:28:30.856] cons_res: select_p_node_init<u></u><u></u></span></p>
</div>
<div>
<p class="MsoNormal"><span style="color:#313131">[2017-11-29T16:28:30.856] cons_res: preparing for 1 partitions<u></u><u></u></span></p>
</div>
<div>
<p class="MsoNormal"><span style="color:#313131">[2017-11-29T16:28:30.856] Running as primary controller<u></u><u></u></span></p>
</div>
<div>
<p class="MsoNormal"><span style="color:#313131">[2017-11-29T16:28:30.856] Registering slurmctld at port 6817 with slurmdbd.<u></u><u></u></span></p>
</div>
<div>
<p class="MsoNormal"><span style="color:#313131">[2017-11-29T16:28:31.098] No parameter for mcs plugin, default values set<u></u><u></u></span></p>
</div>
<div>
<p class="MsoNormal"><span style="color:#313131">[2017-11-29T16:28:31.098] mcs: MCSParameters = (null). ondemand set.<u></u><u></u></span></p>
</div>
</div>
<div>
<div>
<p class="MsoNormal"><span style="color:#313131">[2017-11-29T16:29:31.169] SchedulerParameters=default_queue_depth=100,max_rpc_cnt=0,max_sched_time=2,partition_job_depth=0,sched_max_job_start=0,sched_min_interval=2<u></u><u></u></span></p>
</div>
</div>
<div>
<p class="MsoNormal"><span style="color:#313131"><u></u> <u></u></span></p>
</div>
<div>
<p class="MsoNormal"><span style="color:#313131">David<u></u><u></u></span></p>
</div>
<div>
<p class="MsoNormal"><span style="color:#313131"><u></u> <u></u></span></p>
</div>
<div>
<p class="MsoNormal"><span style="color:#313131"><u></u> <u></u></span></p>
</div>
<div>
<p class="MsoNormal"><span style="color:#313131"><u></u> <u></u></span></p>
</div>
<div>
<div>
<p class="MsoNormal">El El mié, 29 nov 2017 a las 15:59, Steffen Grunewald <<a href="mailto:steffen.grunewald@aei.mpg.de" target="_blank">steffen.grunewald@aei.mpg.de</a>> escribió:<u></u><u></u></p>
</div>
<blockquote style="border:none;border-left:solid #cccccc 1.0pt;padding:0cm 0cm 0cm 6.0pt;margin-left:4.8pt;margin-right:0cm">
<p class="MsoNormal" style="margin-bottom:12.0pt">Hi David,<br>
<br>
On Wed, 2017-11-29 at 14:45:06 +0000, david vilanova wrote:<br>
> Hello,<br>
> I have installed latest 7.11 release and my node is shown as down.<br>
> I hava a single physical server with 12 cores so not sure the conf below is<br>
> correct ?? can you help ??<br>
><br>
> In slurm.conf the node is configure as follows:<br>
><br>
> NodeName=linuxcluster CPUs=1 RealMemory=991 Sockets=12 CoresPerSocket=1<br>
> ThreadsPerCore=1 Feature=local<br>
<br>
12 Sockets? Certainly not... 12 Cores per socket, yes.<br>
(IIRC CPUS shouldn't be specified if the detailed topology is given.<br>
You may try CPUs=12 and drop the details.)<br>
<br>
> PartitionName=testq Nodes=inuxcluster Default=YES MaxTime=INFINITE State=UP<br>
^^ typo?<br>
<br>
Cheers,<br>
Steffen<u></u><u></u></p>
</blockquote>
</div>
</div>
</div></div></blockquote></div></div>