[slurm-users] Slurm not starting

Mon Jan 15 08:50:01 MST 2018

In the headnode. (I'm also noticing, and seems good to tell, for maybe the
problem is the same, even ldap is not working as expected giving a message
"invalid credential (49)" which is a message given when there are problem
of this type. The update to jessie must have touched something that is
affecting all my software sanity :D )

Here is the my slurm.conf.

# slurm.conf file generated by configurator.html.
# Put this file on all nodes of your cluster.
# See the slurm.conf man page for more information.
#
ControlMachine=anyone
ControlAddr=master
#BackupController=
#BackupAddr=
#
AuthType=auth/munge
CacheGroups=0
#CheckpointType=checkpoint/none
CryptoType=crypto/munge
#DisableRootJobs=NO
#EnforcePartLimits=NO
#Epilog=
#EpilogSlurmctld=
#FirstJobId=1
#MaxJobId=999999
#GresTypes=
#GroupUpdateForce=0
#GroupUpdateTime=600
#JobCheckpointDir=/var/slurm/checkpoint
#JobCredentialPrivateKey=
#JobCredentialPublicCertificate=
#JobFileAppend=0
#JobRequeue=1
#JobSubmitPlugins=1
#KillOnBadExit=0
#Licenses=foo*4,bar
#MailProg=/bin/mail
#MaxJobCount=5000
#MaxStepCount=40000
#MaxTasksPerNode=128
MpiDefault=openmpi
MpiParams=ports=12000-12999
#PluginDir=
#PlugStackConfig=
#PrivateData=jobs
ProctrackType=proctrack/cgroup
#Prolog=
#PrologSlurmctld=
#PropagatePrioProcess=0
#PropagateResourceLimits=
#PropagateResourceLimitsExcept=
ReturnToService=2
#SallocDefaultCommand=
SlurmctldPidFile=/var/run/slurmctld.pid
SlurmctldPort=6817
SlurmdPidFile=/var/run/slurmd.pid
SlurmdPort=6818
SlurmdSpoolDir=/tmp/slurmd
SlurmUser=slurm
#SlurmdUser=root
#SrunEpilog=
#SrunProlog=
StateSaveLocation=/tmp
SwitchType=switch/none
#TaskEpilog=
TaskPlugin=task/cgroup
#TaskPluginParam=
#TaskProlog=
#TopologyPlugin=topology/tree
#TmpFs=/tmp
#TrackWCKey=no
#TreeWidth=
#UnkillableStepProgram=
#UsePAM=0
#
#
# TIMERS
#BatchStartTimeout=10
#CompleteWait=0
#EpilogMsgTime=2000
#GetEnvTimeout=2
#HealthCheckInterval=0
#HealthCheckProgram=
InactiveLimit=0
KillWait=60
#MessageTimeout=10
#ResvOverRun=0
MinJobAge=43200
#OverTimeLimit=0
SlurmctldTimeout=600
SlurmdTimeout=600
#UnkillableStepTimeout=60
#VSizeFactor=0
Waittime=0
#
#
# SCHEDULING
DefMemPerCPU=1000
FastSchedule=1
#MaxMemPerCPU=0
#SchedulerRootFilter=1
#SchedulerTimeSlice=30
SchedulerType=sched/backfill
#SchedulerPort=
SelectType=select/cons_res
SelectTypeParameters=CR_CPU_Memory
#
#
# JOB PRIORITY
#PriorityType=priority/basic
#PriorityDecayHalfLife=
#PriorityCalcPeriod=
#PriorityFavorSmall=
#PriorityMaxAge=
#PriorityUsageResetPeriod=
#PriorityWeightAge=
#PriorityWeightFairshare=
#PriorityWeightJobSize=
#PriorityWeightPartition=
#PriorityWeightQOS=
#
#
# LOGGING AND ACCOUNTING
#AccountingStorageEnforce=0
#AccountingStorageHost=
AccountingStorageLoc=/var/log/slurm-llnl/AccountingStorage.log
#AccountingStoragePass=
#AccountingStoragePort=
AccountingStorageType=accounting_storage/filetxt
#AccountingStorageUser=
AccountingStoreJobComment=YES
ClusterName=cluster
#DebugFlags=
#JobCompHost=
JobCompLoc=/var/log/slurm-llnl/JobComp.log
#JobCompPass=
#JobCompPort=
JobCompType=jobcomp/filetxt
#JobCompUser=
JobAcctGatherFrequency=60
JobAcctGatherType=jobacct_gather/linux
SlurmctldDebug=3
#SlurmctldLogFile=
SlurmdDebug=3
#SlurmdLogFile=
#SlurmSchedLogFile=
#SlurmSchedLogLevel=
#
#
# POWER SAVE SUPPORT FOR IDLE NODES (optional)
#SuspendProgram=
#ResumeProgram=
#SuspendTimeout=
#ResumeTimeout=
#ResumeRate=
#SuspendExcNodes=
#SuspendExcParts=
#SuspendRate=
#SuspendTime=
#
#
# COMPUTE NODES
NodeName=node[01-08] CPUs=16 RealMemory=16000 State=UNKNOWN
PartitionName=batch Nodes=node[01-08] Default=YES MaxTime=INFINITE State=UP

2018-01-15 16:43 GMT+01:00 Carlos Fenoy <minibit at gmail.com>:

> Are you trying to start the slurmd in the headnode or a compute node?
>
> Can you provide the slurm.conf file?
>
> Regards,
> Carlos
>
> On Mon, Jan 15, 2018 at 4:30 PM, Elisabetta Falivene <
> e.falivene at ilabroma.com> wrote:
>
>> slurmd -Dvvv says
>>
>> slurmd: fatal: Unable to determine this slurmd's NodeName
>>
>> b
>>
>> 2018-01-15 15:58 GMT+01:00 Douglas Jacobsen <dmjacobsen at lbl.gov>:
>>
>>> The fact that sinfo is responding shows that at least slurmctld is
>>> running.  Slumd, on the other hand is not.  Please also get output of
>>> slurmd log or running "slurmd -Dvvv"
>>>
>>
>>
>>
>>
>>>
>>> On Jan 15, 2018 06:42, "Elisabetta Falivene" <e.falivene at ilabroma.com>
>>> wrote:
>>>
>>>> > Anyway I suggest to update the operating system to stretch and fix
>>>> your
>>>> > configuration under a more recent version of slurm.
>>>>
>>>> I think I'll soon arrive to that :)
>>>> b
>>>>
>>>> 2018-01-15 14:08 GMT+01:00 Gennaro Oliva <oliva.g at na.icar.cnr.it>:
>>>>
>>>>> Ciao Elisabetta,
>>>>>
>>>>> On Mon, Jan 15, 2018 at 01:13:27PM +0100, Elisabetta Falivene wrote:
>>>>> > Error messages are not much helping me in guessing what is going on.
>>>>> What
>>>>> > should I check to get what is failing?
>>>>>
>>>>> check slurmctld.log and slurmd.log, you can find them under
>>>>> /var/log/slurm-llnl
>>>>>
>>>>> > *PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST*
>>>>> > *batch*       up   infinite      8   unk* node[01-08]*
>>>>> >
>>>>> >
>>>>> > Running
>>>>> > *systemctl status slurmctld.service*
>>>>> >
>>>>> > returns
>>>>> >
>>>>> > *slurmctld.service - Slurm controller daemon*
>>>>> > *   Loaded: loaded (/lib/systemd/system/slurmctld.service; enabled)*
>>>>> > *   Active: failed (Result: timeout) since Mon 2018-01-15 13:03:39
>>>>> CET; 41s
>>>>> > ago*
>>>>> > *  Process: 2098 ExecStart=/usr/sbin/slurmctld $SLURMCTLD_OPTIONS
>>>>> > (code=exited, status=0/SUCCESS)*
>>>>> >
>>>>> > * slurmctld[2100]: cons_res: select_p_reconfigure*
>>>>> > * slurmctld[2100]: cons_res: select_p_node_init*
>>>>> > * slurmctld[2100]: cons_res: preparing for 1 partitions*
>>>>> > * slurmctld[2100]: Running as primary controller*
>>>>> > * slurmctld[2100]:
>>>>> > SchedulerParameters=default_queue_depth=100,max_rpc_cnt=0,ma
>>>>> x_sched_time=4,partition_job_depth=0*
>>>>> > * slurmctld.service start operation timed out. Terminating.*
>>>>> > *Terminate signal (SIGINT or SIGTERM) received*
>>>>> > * slurmctld[2100]: Saving all slurm state*
>>>>> > * Failed to start Slurm controller daemon.*
>>>>> > * Unit slurmctld.service entered failed state.*
>>>>>
>>>>> Do you have a backup controller?
>>>>> Check your slurm.conf under:
>>>>> /etc/slurm-llnl
>>>>>
>>>>> Anyway I suggest to update the operating system to stretch and fix your
>>>>> configuration under a more recent version of slurm.
>>>>> Best regards
>>>>> --
>>>>> Gennaro Oliva
>>>>>
>>>>>
>>>>
>>
>
>
> --
> --
> Carles Fenoy
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20180115/7c433648/attachment.html>