[slurm-users] slurm conf with single machine with multi cores.

Thu Nov 30 02:36:22 MST 2017

Hello David,

slurmd daemon is not running (while slurmctld and slurmdbd are).

slurmd.log (different from slurmctld.log) should contain more information.

Regards,
Pierre-Marie Le Biot

From: slurm-users [mailto:slurm-users-bounces at lists.schedmd.com] On Behalf Of david vilanova
Sent: Thursday, November 30, 2017 9:32 AM
To: Slurm User Community List <slurm-users at lists.schedmd.com>
Subject: Re: [slurm-users] slurm conf with single machine with multi cores.

Sorry for the delay, was trying to fix it but still not working.

The node is always down. The master machine is also the compute machine. It's a single server that i use for that. 1 node and 12 cpus.

In the log below i see this line
[2017-11-30T09:24:41.764] agent/is_node_resp: node:linuxcluster RPC:REQUEST_NODE_REGISTRATION_STATUS : Communication connection failure

Here below my slurm.conf file:

ControlMachine=linuxcluster
AuthType=auth/munge
CryptoType=crypto/munge
MailProg=/usr/bin/mail
MpiDefault=none
PluginDir=/usr/local/lib/slurm
ProctrackType=proctrack/cgroup
ReturnToService=1
SlurmctldPidFile=/var/run/slurmctld.pid
SlurmctldPort=6817
SlurmdPidFile=/var/run/slurmd.pid
SlurmdPort=6818
SlurmdSpoolDir=/var/spool/slurm/d
SlurmUser=slurm
StateSaveLocation=/var/spool/slurm/ctld
SwitchType=switch/none
TaskPlugin=task/none
InactiveLimit=0
KillWait=30
MinJobAge=300
SlurmctldTimeout=120
SlurmdTimeout=300
Waittime=0
FastSchedule=1
SchedulerType=sched/backfill
AccountingStorageHost=linuxcluster
AccountingStorageType=accounting_storage/slurmdbd
AccountingStorageUser=slurm
AccountingStoreJobComment=YES
ClusterName=linuxcluster
JobCompType=jobcomp/none
JobCompUser=slurm
JobAcctGatherFrequency=30
JobAcctGatherType=jobacct_gather/cgroup
SlurmctldDebug=5
SlurmctldLogFile=/var/log/slurm/slurmctrl.log
SlurmdDebug=5

SelectType=select/cons_res
SelectTypeParameters=CR_CPU
NodeName=linuxcluster CPUs=12
PartitionName=testq Nodes=linuxclusterDefault=YES MaxTime=INFINITE State=UP

slurmctrld.log:
[2017-11-30T09:24:28.025] debug:  Log file re-opened
[2017-11-30T09:24:28.025] debug:  sched: slurmctld starting
[2017-11-30T09:24:28.025] slurmctld version 17.11.0 started on cluster linuxcluster
[2017-11-30T09:24:28.026] Munge cryptographic signature plugin loaded
[2017-11-30T09:24:28.026] Consumable Resources (CR) Node Selection plugin loaded with argument 1
[2017-11-30T09:24:28.026] preempt/none loaded
[2017-11-30T09:24:28.026] debug:  Checkpoint plugin loaded: checkpoint/none
[2017-11-30T09:24:28.026] debug:  AcctGatherEnergy NONE plugin loaded
[2017-11-30T09:24:28.026] debug:  AcctGatherProfile NONE plugin loaded
[2017-11-30T09:24:28.026] debug:  AcctGatherInterconnect NONE plugin loaded
[2017-11-30T09:24:28.026] debug:  AcctGatherFilesystem NONE plugin loaded
[2017-11-30T09:24:28.026] debug:  Job accounting gather cgroup plugin loaded
[2017-11-30T09:24:28.026] ExtSensors NONE plugin loaded
[2017-11-30T09:24:28.026] debug:  switch NONE plugin loaded
[2017-11-30T09:24:28.026] debug:  power_save module disabled, SuspendTime < 0
[2017-11-30T09:24:28.026] debug:  No backup controller to shutdown
[2017-11-30T09:24:28.026] Accounting storage SLURMDBD plugin loaded with AuthInfo=(null)
[2017-11-30T09:24:28.027] debug:  Munge authentication plugin loaded
[2017-11-30T09:24:28.030] debug:  slurmdbd: Sent PersistInit msg
[2017-11-30T09:24:28.030] slurmdbd: recovered 0 pending RPCs
[2017-11-30T09:24:28.429] debug:  Reading slurm.conf file: /usr/local/etc/slurm.conf
[2017-11-30T09:24:28.430] layouts: no layout to initialize
[2017-11-30T09:24:28.430] topology NONE plugin loaded
[2017-11-30T09:24:28.430] debug:  No DownNodes
[2017-11-30T09:24:28.435] debug:  Log file re-opened
[2017-11-30T09:24:28.435] sched: Backfill scheduler plugin loaded
[2017-11-30T09:24:28.435] route default plugin loaded
[2017-11-30T09:24:28.435] layouts: loading entities/relations information
[2017-11-30T09:24:28.435] debug:  layouts: 1/1 nodes in hash table, rc=0
[2017-11-30T09:24:28.435] debug:  layouts: loading stage 1
[2017-11-30T09:24:28.435] debug:  layouts: loading stage 1.1 (restore state)
[2017-11-30T09:24:28.435] debug:  layouts: loading stage 2
[2017-11-30T09:24:28.435] debug:  layouts: loading stage 3
[2017-11-30T09:24:28.435] Recovered state of 1 nodes
[2017-11-30T09:24:28.435] Down nodes: linuxcluster
[2017-11-30T09:24:28.435] Recovered JobID=15 State=0x4 NodeCnt=0 Assoc=6
[2017-11-30T09:24:28.435] Recovered information about 1 jobs
[2017-11-30T09:24:28.435] cons_res: select_p_node_init
[2017-11-30T09:24:28.436] cons_res: preparing for 1 partitions
[2017-11-30T09:24:28.436] debug:  Updating partition uid access list
[2017-11-30T09:24:28.436] Recovered state of 0 reservations
[2017-11-30T09:24:28.436] State of 0 triggers recovered
[2017-11-30T09:24:28.436] _preserve_plugins: backup_controller not specified
[2017-11-30T09:24:28.436] cons_res: select_p_reconfigure
[2017-11-30T09:24:28.436] cons_res: select_p_node_init
[2017-11-30T09:24:28.436] cons_res: preparing for 1 partitions
[2017-11-30T09:24:28.436] Running as primary controller
[2017-11-30T09:24:28.436] debug:  No BackupController, not launching heartbeat.
[2017-11-30T09:24:28.436] Registering slurmctld at port 6817 with slurmdbd.
[2017-11-30T09:24:28.677] debug:  No feds to retrieve from state
[2017-11-30T09:24:28.757] debug:  Priority BASIC plugin loaded
[2017-11-30T09:24:28.758] No parameter for mcs plugin, default values set
[2017-11-30T09:24:28.758] mcs: MCSParameters = (null). ondemand set.
[2017-11-30T09:24:28.758] debug:  mcs none plugin loaded
[2017-11-30T09:24:28.758] debug:  power_save mode not enabled
[2017-11-30T09:24:31.761] debug:  Spawning registration agent for linuxcluster1 hosts
[2017-11-30T09:24:41.764] agent/is_node_resp: node:linuxcluster RPC:REQUEST_NODE_REGISTRATION_STATUS : Communication connection failure
[2017-11-30T09:24:58.435] debug:  backfill: beginning
[2017-11-30T09:24:58.435] debug:  backfill: no jobs to backfill
[2017-11-30T09:25:28.435] debug:  backfill: beginning
[2017-11-30T09:25:28.436] debug:  backfill: no jobs to backfill
[2017-11-30T09:25:28.830] SchedulerParameters=default_queue_depth=100,max_rpc_cnt=0,max_sched_time=2,partition_job_depth=0,sched_max_job_sta
rt=0,sched_min_interval=2
[2017-11-30T09:25:28.830] debug:  sched: Running job scheduler
[2017-11-30T09:25:58.436] debug:  backfill: beginning
[2017-11-30T09:25:58.436] debug:  backfill: no jobs to backfill

ps -ef | grep slurm
ubuntu at linuxcluster:/home/dvi/$         ps -ef | grep slurm
slurm    11388     1  0 09:24 ?        00:00:00 /usr/local/sbin/slurmdbd
slurm    11430     1  0 09:24 ?        00:00:00 /usr/local/sbin/slurmctld

Any idea ?

El El mié, 29 nov 2017 a las 18:21, Le Biot, Pierre-Marie <pierre-marie.lebiot at hpe.com<mailto:pierre-marie.lebiot at hpe.com>> escribió:
Hello David,

So linuxcluster is the Head node and also a Compute node ?

Is slurmd running ?

What does /var/log/slurm/slurmd.log say ?

Regards,
Pierre-Marie Le Biot

From: slurm-users [mailto:slurm-users-bounces at lists.schedmd.com<mailto:slurm-users-bounces at lists.schedmd.com>] On Behalf Of david vilanova
Sent: Wednesday, November 29, 2017 4:33 PM
To: Slurm User Community List <slurm-users at lists.schedmd.com<mailto:slurm-users at lists.schedmd.com>>
Subject: Re: [slurm-users] slurm conf with single machine with multi cores.

Hi,
I have updated the slurm.conf as follows:

SelectType=select/cons_res
SelectTypeParameters=CR_CPU
NodeName=linuxcluster CPUs=2
PartitionName=testq Nodes=linuxcluster Default=YES MaxTime=INFINITE State=UP

Still get testq node in down status ??? Any idea ?

Below log from db and controller:
==> /var/log/slurm/slurmctrl.log <==
[2017-11-29T16:28:30.446] slurmctld version 17.11.0 started on cluster linuxcluster
[2017-11-29T16:28:30.850] error: SelectType specified more than once, latest value used
[2017-11-29T16:28:30.851] layouts: no layout to initialize
[2017-11-29T16:28:30.855] layouts: loading entities/relations information
[2017-11-29T16:28:30.855] Recovered state of 1 nodes
[2017-11-29T16:28:30.855] Down nodes: linuxcluster
[2017-11-29T16:28:30.855] Recovered information about 0 jobs
[2017-11-29T16:28:30.855] cons_res: select_p_node_init
[2017-11-29T16:28:30.855] cons_res: preparing for 1 partitions
[2017-11-29T16:28:30.856] Recovered state of 0 reservations
[2017-11-29T16:28:30.856] _preserve_plugins: backup_controller not specified
[2017-11-29T16:28:30.856] cons_res: select_p_reconfigure
[2017-11-29T16:28:30.856] cons_res: select_p_node_init
[2017-11-29T16:28:30.856] cons_res: preparing for 1 partitions
[2017-11-29T16:28:30.856] Running as primary controller
[2017-11-29T16:28:30.856] Registering slurmctld at port 6817 with slurmdbd.
[2017-11-29T16:28:31.098] No parameter for mcs plugin, default values set
[2017-11-29T16:28:31.098] mcs: MCSParameters = (null). ondemand set.
[2017-11-29T16:29:31.169] SchedulerParameters=default_queue_depth=100,max_rpc_cnt=0,max_sched_time=2,partition_job_depth=0,sched_max_job_start=0,sched_min_interval=2

David

El El mié, 29 nov 2017 a las 15:59, Steffen Grunewald <steffen.grunewald at aei.mpg.de<mailto:steffen.grunewald at aei.mpg.de>> escribió:
Hi David,

On Wed, 2017-11-29 at 14:45:06 +0000, david vilanova wrote:
> Hello,
> I have installed latest 7.11 release and my node is shown as down.
> I hava a single physical server with 12 cores so not sure the conf below is
> correct ?? can you help ??
>
> In slurm.conf the node is configure as follows:
>
> NodeName=linuxcluster CPUs=1 RealMemory=991 Sockets=12 CoresPerSocket=1
> ThreadsPerCore=1 Feature=local

12 Sockets? Certainly not... 12 Cores per socket, yes.
(IIRC CPUS shouldn't be specified if the detailed topology is given.
You may try CPUs=12 and drop the details.)

> PartitionName=testq Nodes=inuxcluster Default=YES MaxTime=INFINITE State=UP
                           ^^ typo?

Cheers,
 Steffen
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20171130/bd9b7f98/attachment-0001.html>