[slurm-users] slurm conf with single machine with multi cores.

Thu Nov 30 01:31:48 MST 2017

Sorry for the delay, was trying to fix it but still not working.

The node is always down. The master machine is also the compute machine.
It's a single server that i use for that. 1 node and 12 cpus.

In the log below i see this line
[2017-11-30T09:24:41.764] agent/is_node_resp: node:linuxcluster
RPC:REQUEST_NODE_REGISTRATION_STATUS : Communication connection failure

Here below my slurm.conf file:

ControlMachine=linuxcluster
AuthType=auth/munge
CryptoType=crypto/munge
MailProg=/usr/bin/mail
MpiDefault=none
PluginDir=/usr/local/lib/slurm
ProctrackType=proctrack/cgroup
ReturnToService=1
SlurmctldPidFile=/var/run/slurmctld.pid
SlurmctldPort=6817
SlurmdPidFile=/var/run/slurmd.pid
SlurmdPort=6818
SlurmdSpoolDir=/var/spool/slurm/d
SlurmUser=slurm
StateSaveLocation=/var/spool/slurm/ctld
SwitchType=switch/none
TaskPlugin=task/none
InactiveLimit=0
KillWait=30
MinJobAge=300
SlurmctldTimeout=120
SlurmdTimeout=300
Waittime=0
FastSchedule=1
SchedulerType=sched/backfill
AccountingStorageHost=linuxcluster
AccountingStorageType=accounting_storage/slurmdbd
AccountingStorageUser=slurm
AccountingStoreJobComment=YES
ClusterName=linuxcluster
JobCompType=jobcomp/none
JobCompUser=slurm
JobAcctGatherFrequency=30
JobAcctGatherType=jobacct_gather/cgroup
SlurmctldDebug=5
SlurmctldLogFile=/var/log/slurm/slurmctrl.log
SlurmdDebug=5

SelectType=select/cons_res
SelectTypeParameters=CR_CPU
NodeName=linuxcluster CPUs=12
PartitionName=testq Nodes=linuxclusterDefault=YES MaxTime=INFINITE State=UP

slurmctrld.log:
[2017-11-30T09:24:28.025] debug:  Log file re-opened
[2017-11-30T09:24:28.025] debug:  sched: slurmctld starting
[2017-11-30T09:24:28.025] slurmctld version 17.11.0 started on cluster
linuxcluster
[2017-11-30T09:24:28.026] Munge cryptographic signature plugin loaded
[2017-11-30T09:24:28.026] Consumable Resources (CR) Node Selection plugin
loaded with argument 1
[2017-11-30T09:24:28.026] preempt/none loaded
[2017-11-30T09:24:28.026] debug:  Checkpoint plugin loaded: checkpoint/none
[2017-11-30T09:24:28.026] debug:  AcctGatherEnergy NONE plugin loaded
[2017-11-30T09:24:28.026] debug:  AcctGatherProfile NONE plugin loaded
[2017-11-30T09:24:28.026] debug:  AcctGatherInterconnect NONE plugin loaded
[2017-11-30T09:24:28.026] debug:  AcctGatherFilesystem NONE plugin loaded
[2017-11-30T09:24:28.026] debug:  Job accounting gather cgroup plugin loaded
[2017-11-30T09:24:28.026] ExtSensors NONE plugin loaded
[2017-11-30T09:24:28.026] debug:  switch NONE plugin loaded
[2017-11-30T09:24:28.026] debug:  power_save module disabled, SuspendTime <
0
[2017-11-30T09:24:28.026] debug:  No backup controller to shutdown
[2017-11-30T09:24:28.026] Accounting storage SLURMDBD plugin loaded with
AuthInfo=(null)
[2017-11-30T09:24:28.027] debug:  Munge authentication plugin loaded
[2017-11-30T09:24:28.030] debug:  slurmdbd: Sent PersistInit msg
[2017-11-30T09:24:28.030] slurmdbd: recovered 0 pending RPCs
[2017-11-30T09:24:28.429] debug:  Reading slurm.conf file:
/usr/local/etc/slurm.conf
[2017-11-30T09:24:28.430] layouts: no layout to initialize
[2017-11-30T09:24:28.430] topology NONE plugin loaded
[2017-11-30T09:24:28.430] debug:  No DownNodes
[2017-11-30T09:24:28.435] debug:  Log file re-opened
[2017-11-30T09:24:28.435] sched: Backfill scheduler plugin loaded
[2017-11-30T09:24:28.435] route default plugin loaded
[2017-11-30T09:24:28.435] layouts: loading entities/relations information
[2017-11-30T09:24:28.435] debug:  layouts: 1/1 nodes in hash table, rc=0
[2017-11-30T09:24:28.435] debug:  layouts: loading stage 1
[2017-11-30T09:24:28.435] debug:  layouts: loading stage 1.1 (restore state)
[2017-11-30T09:24:28.435] debug:  layouts: loading stage 2
[2017-11-30T09:24:28.435] debug:  layouts: loading stage 3
[2017-11-30T09:24:28.435] Recovered state of 1 nodes
[2017-11-30T09:24:28.435] Down nodes: linuxcluster
[2017-11-30T09:24:28.435] Recovered JobID=15 State=0x4 NodeCnt=0 Assoc=6
[2017-11-30T09:24:28.435] Recovered information about 1 jobs
[2017-11-30T09:24:28.435] cons_res: select_p_node_init
[2017-11-30T09:24:28.436] cons_res: preparing for 1 partitions
[2017-11-30T09:24:28.436] debug:  Updating partition uid access list
[2017-11-30T09:24:28.436] Recovered state of 0 reservations
[2017-11-30T09:24:28.436] State of 0 triggers recovered
[2017-11-30T09:24:28.436] _preserve_plugins: backup_controller not specified
[2017-11-30T09:24:28.436] cons_res: select_p_reconfigure
[2017-11-30T09:24:28.436] cons_res: select_p_node_init
[2017-11-30T09:24:28.436] cons_res: preparing for 1 partitions
[2017-11-30T09:24:28.436] Running as primary controller
[2017-11-30T09:24:28.436] debug:  No BackupController, not launching
heartbeat.
[2017-11-30T09:24:28.436] Registering slurmctld at port 6817 with slurmdbd.
[2017-11-30T09:24:28.677] debug:  No feds to retrieve from state
[2017-11-30T09:24:28.757] debug:  Priority BASIC plugin loaded
[2017-11-30T09:24:28.758] No parameter for mcs plugin, default values set
[2017-11-30T09:24:28.758] mcs: MCSParameters = (null). ondemand set.
[2017-11-30T09:24:28.758] debug:  mcs none plugin loaded
[2017-11-30T09:24:28.758] debug:  power_save mode not enabled
[2017-11-30T09:24:31.761] debug:  Spawning registration agent for
linuxcluster1 hosts
[2017-11-30T09:24:41.764] agent/is_node_resp: node:linuxcluster
RPC:REQUEST_NODE_REGISTRATION_STATUS : Communication connection failure
[2017-11-30T09:24:58.435] debug:  backfill: beginning
[2017-11-30T09:24:58.435] debug:  backfill: no jobs to backfill
[2017-11-30T09:25:28.435] debug:  backfill: beginning
[2017-11-30T09:25:28.436] debug:  backfill: no jobs to backfill
[2017-11-30T09:25:28.830]
SchedulerParameters=default_queue_depth=100,max_rpc_cnt=0,max_sched_time=2,partition_job_depth=0,sched_max_job_sta
rt=0,sched_min_interval=2
[2017-11-30T09:25:28.830] debug:  sched: Running job scheduler
[2017-11-30T09:25:58.436] debug:  backfill: beginning
[2017-11-30T09:25:58.436] debug:  backfill: no jobs to backfill

ps -ef | grep slurm
ubuntu at linuxcluster:/home/dvi/$         ps -ef | grep slurm
slurm    11388     1  0 09:24 ?        00:00:00 /usr/local/sbin/slurmdbd
slurm    11430     1  0 09:24 ?        00:00:00 /usr/local/sbin/slurmctld

Any idea ?

El El mié, 29 nov 2017 a las 18:21, Le Biot, Pierre-Marie <
pierre-marie.lebiot at hpe.com> escribió:

> Hello David,
>
>
>
> So linuxcluster is the Head node and also a Compute node ?
>
>
>
> Is slurmd running ?
>
>
>
> What does /var/log/slurm/slurmd.log say ?
>
>
>
> Regards,
>
> Pierre-Marie Le Biot
>
>
>
>
>
> *From:* slurm-users [mailto:slurm-users-bounces at lists.schedmd.com] *On
> Behalf Of *david vilanova
> *Sent:* Wednesday, November 29, 2017 4:33 PM
> *To:* Slurm User Community List <slurm-users at lists.schedmd.com>
> *Subject:* Re: [slurm-users] slurm conf with single machine with multi
> cores.
>
>
>
> Hi,
>
> I have updated the slurm.conf as follows:
>
>
> SelectType=select/cons_res
>
> SelectTypeParameters=CR_CPU
>
> NodeName=linuxcluster CPUs=2
>
> PartitionName=testq Nodes=linuxcluster Default=YES MaxTime=INFINITE
> State=UP
>
>
>
> Still get testq node in down status ??? Any idea ?
>
>
>
> Below log from db and controller:
>
> ==> /var/log/slurm/slurmctrl.log <==
>
> [2017-11-29T16:28:30.446] slurmctld version 17.11.0 started on cluster
> linuxcluster
>
> [2017-11-29T16:28:30.850] error: SelectType specified more than once,
> latest value used
>
> [2017-11-29T16:28:30.851] layouts: no layout to initialize
>
> [2017-11-29T16:28:30.855] layouts: loading entities/relations information
>
> [2017-11-29T16:28:30.855] Recovered state of 1 nodes
>
> [2017-11-29T16:28:30.855] Down nodes: linuxcluster
>
> [2017-11-29T16:28:30.855] Recovered information about 0 jobs
>
> [2017-11-29T16:28:30.855] cons_res: select_p_node_init
>
> [2017-11-29T16:28:30.855] cons_res: preparing for 1 partitions
>
> [2017-11-29T16:28:30.856] Recovered state of 0 reservations
>
> [2017-11-29T16:28:30.856] _preserve_plugins: backup_controller not
> specified
>
> [2017-11-29T16:28:30.856] cons_res: select_p_reconfigure
>
> [2017-11-29T16:28:30.856] cons_res: select_p_node_init
>
> [2017-11-29T16:28:30.856] cons_res: preparing for 1 partitions
>
> [2017-11-29T16:28:30.856] Running as primary controller
>
> [2017-11-29T16:28:30.856] Registering slurmctld at port 6817 with slurmdbd.
>
> [2017-11-29T16:28:31.098] No parameter for mcs plugin, default values set
>
> [2017-11-29T16:28:31.098] mcs: MCSParameters = (null). ondemand set.
>
> [2017-11-29T16:29:31.169]
> SchedulerParameters=default_queue_depth=100,max_rpc_cnt=0,max_sched_time=2,partition_job_depth=0,sched_max_job_start=0,sched_min_interval=2
>
>
>
> David
>
>
>
>
>
>
>
> El El mié, 29 nov 2017 a las 15:59, Steffen Grunewald <
> steffen.grunewald at aei.mpg.de> escribió:
>
> Hi David,
>
> On Wed, 2017-11-29 at 14:45:06 +0000, david vilanova wrote:
> > Hello,
> > I have installed latest 7.11 release and my node is shown as down.
> > I hava a single physical server with 12 cores so not sure the conf below
> is
> > correct ?? can you help ??
> >
> > In slurm.conf the node is configure as follows:
> >
> > NodeName=linuxcluster CPUs=1 RealMemory=991 Sockets=12 CoresPerSocket=1
> > ThreadsPerCore=1 Feature=local
>
> 12 Sockets? Certainly not... 12 Cores per socket, yes.
> (IIRC CPUS shouldn't be specified if the detailed topology is given.
> You may try CPUs=12 and drop the details.)
>
> > PartitionName=testq Nodes=inuxcluster Default=YES MaxTime=INFINITE
> State=UP
>                            ^^ typo?
>
> Cheers,
>  Steffen
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20171130/236e88e8/attachment-0001.html>