[slurm-users] Slurm not starting

Carlos Fenoy minibit at gmail.com
Mon Jan 15 08:56:09 MST 2018


Hi,

you can not start the slurmd on the headnode. Try running the same command
on the compute nodes and check the output. If there is any issue it should
display the reason.

Regards,
Carlos

On Mon, Jan 15, 2018 at 4:50 PM, Elisabetta Falivene <
e.falivene at ilabroma.com> wrote:

> In the headnode. (I'm also noticing, and seems good to tell, for maybe the
> problem is the same, even ldap is not working as expected giving a message
> "invalid credential (49)" which is a message given when there are problem
> of this type. The update to jessie must have touched something that is
> affecting all my software sanity :D )
>
> Here is the my slurm.conf.
>
> # slurm.conf file generated by configurator.html.
> # Put this file on all nodes of your cluster.
> # See the slurm.conf man page for more information.
> #
> ControlMachine=anyone
> ControlAddr=master
> #BackupController=
> #BackupAddr=
> #
> AuthType=auth/munge
> CacheGroups=0
> #CheckpointType=checkpoint/none
> CryptoType=crypto/munge
> #DisableRootJobs=NO
> #EnforcePartLimits=NO
> #Epilog=
> #EpilogSlurmctld=
> #FirstJobId=1
> #MaxJobId=999999
> #GresTypes=
> #GroupUpdateForce=0
> #GroupUpdateTime=600
> #JobCheckpointDir=/var/slurm/checkpoint
> #JobCredentialPrivateKey=
> #JobCredentialPublicCertificate=
> #JobFileAppend=0
> #JobRequeue=1
> #JobSubmitPlugins=1
> #KillOnBadExit=0
> #Licenses=foo*4,bar
> #MailProg=/bin/mail
> #MaxJobCount=5000
> #MaxStepCount=40000
> #MaxTasksPerNode=128
> MpiDefault=openmpi
> MpiParams=ports=12000-12999
> #PluginDir=
> #PlugStackConfig=
> #PrivateData=jobs
> ProctrackType=proctrack/cgroup
> #Prolog=
> #PrologSlurmctld=
> #PropagatePrioProcess=0
> #PropagateResourceLimits=
> #PropagateResourceLimitsExcept=
> ReturnToService=2
> #SallocDefaultCommand=
> SlurmctldPidFile=/var/run/slurmctld.pid
> SlurmctldPort=6817
> SlurmdPidFile=/var/run/slurmd.pid
> SlurmdPort=6818
> SlurmdSpoolDir=/tmp/slurmd
> SlurmUser=slurm
> #SlurmdUser=root
> #SrunEpilog=
> #SrunProlog=
> StateSaveLocation=/tmp
> SwitchType=switch/none
> #TaskEpilog=
> TaskPlugin=task/cgroup
> #TaskPluginParam=
> #TaskProlog=
> #TopologyPlugin=topology/tree
> #TmpFs=/tmp
> #TrackWCKey=no
> #TreeWidth=
> #UnkillableStepProgram=
> #UsePAM=0
> #
> #
> # TIMERS
> #BatchStartTimeout=10
> #CompleteWait=0
> #EpilogMsgTime=2000
> #GetEnvTimeout=2
> #HealthCheckInterval=0
> #HealthCheckProgram=
> InactiveLimit=0
> KillWait=60
> #MessageTimeout=10
> #ResvOverRun=0
> MinJobAge=43200
> #OverTimeLimit=0
> SlurmctldTimeout=600
> SlurmdTimeout=600
> #UnkillableStepTimeout=60
> #VSizeFactor=0
> Waittime=0
> #
> #
> # SCHEDULING
> DefMemPerCPU=1000
> FastSchedule=1
> #MaxMemPerCPU=0
> #SchedulerRootFilter=1
> #SchedulerTimeSlice=30
> SchedulerType=sched/backfill
> #SchedulerPort=
> SelectType=select/cons_res
> SelectTypeParameters=CR_CPU_Memory
> #
> #
> # JOB PRIORITY
> #PriorityType=priority/basic
> #PriorityDecayHalfLife=
> #PriorityCalcPeriod=
> #PriorityFavorSmall=
> #PriorityMaxAge=
> #PriorityUsageResetPeriod=
> #PriorityWeightAge=
> #PriorityWeightFairshare=
> #PriorityWeightJobSize=
> #PriorityWeightPartition=
> #PriorityWeightQOS=
> #
> #
> # LOGGING AND ACCOUNTING
> #AccountingStorageEnforce=0
> #AccountingStorageHost=
> AccountingStorageLoc=/var/log/slurm-llnl/AccountingStorage.log
> #AccountingStoragePass=
> #AccountingStoragePort=
> AccountingStorageType=accounting_storage/filetxt
> #AccountingStorageUser=
> AccountingStoreJobComment=YES
> ClusterName=cluster
> #DebugFlags=
> #JobCompHost=
> JobCompLoc=/var/log/slurm-llnl/JobComp.log
> #JobCompPass=
> #JobCompPort=
> JobCompType=jobcomp/filetxt
> #JobCompUser=
> JobAcctGatherFrequency=60
> JobAcctGatherType=jobacct_gather/linux
> SlurmctldDebug=3
> #SlurmctldLogFile=
> SlurmdDebug=3
> #SlurmdLogFile=
> #SlurmSchedLogFile=
> #SlurmSchedLogLevel=
> #
> #
> # POWER SAVE SUPPORT FOR IDLE NODES (optional)
> #SuspendProgram=
> #ResumeProgram=
> #SuspendTimeout=
> #ResumeTimeout=
> #ResumeRate=
> #SuspendExcNodes=
> #SuspendExcParts=
> #SuspendRate=
> #SuspendTime=
> #
> #
> # COMPUTE NODES
> NodeName=node[01-08] CPUs=16 RealMemory=16000 State=UNKNOWN
> PartitionName=batch Nodes=node[01-08] Default=YES MaxTime=INFINITE State=UP
>
>
> 2018-01-15 16:43 GMT+01:00 Carlos Fenoy <minibit at gmail.com>:
>
>> Are you trying to start the slurmd in the headnode or a compute node?
>>
>> Can you provide the slurm.conf file?
>>
>> Regards,
>> Carlos
>>
>> On Mon, Jan 15, 2018 at 4:30 PM, Elisabetta Falivene <
>> e.falivene at ilabroma.com> wrote:
>>
>>> slurmd -Dvvv says
>>>
>>> slurmd: fatal: Unable to determine this slurmd's NodeName
>>>
>>> b
>>>
>>> 2018-01-15 15:58 GMT+01:00 Douglas Jacobsen <dmjacobsen at lbl.gov>:
>>>
>>>> The fact that sinfo is responding shows that at least slurmctld is
>>>> running.  Slumd, on the other hand is not.  Please also get output of
>>>> slurmd log or running "slurmd -Dvvv"
>>>>
>>>
>>>
>>>
>>>
>>>>
>>>> On Jan 15, 2018 06:42, "Elisabetta Falivene" <e.falivene at ilabroma.com>
>>>> wrote:
>>>>
>>>>> > Anyway I suggest to update the operating system to stretch and fix
>>>>> your
>>>>> > configuration under a more recent version of slurm.
>>>>>
>>>>> I think I'll soon arrive to that :)
>>>>> b
>>>>>
>>>>> 2018-01-15 14:08 GMT+01:00 Gennaro Oliva <oliva.g at na.icar.cnr.it>:
>>>>>
>>>>>> Ciao Elisabetta,
>>>>>>
>>>>>> On Mon, Jan 15, 2018 at 01:13:27PM +0100, Elisabetta Falivene wrote:
>>>>>> > Error messages are not much helping me in guessing what is going
>>>>>> on. What
>>>>>> > should I check to get what is failing?
>>>>>>
>>>>>> check slurmctld.log and slurmd.log, you can find them under
>>>>>> /var/log/slurm-llnl
>>>>>>
>>>>>> > *PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST*
>>>>>> > *batch*       up   infinite      8   unk* node[01-08]*
>>>>>> >
>>>>>> >
>>>>>> > Running
>>>>>> > *systemctl status slurmctld.service*
>>>>>> >
>>>>>> > returns
>>>>>> >
>>>>>> > *slurmctld.service - Slurm controller daemon*
>>>>>> > *   Loaded: loaded (/lib/systemd/system/slurmctld.service;
>>>>>> enabled)*
>>>>>> > *   Active: failed (Result: timeout) since Mon 2018-01-15 13:03:39
>>>>>> CET; 41s
>>>>>> > ago*
>>>>>> > *  Process: 2098 ExecStart=/usr/sbin/slurmctld $SLURMCTLD_OPTIONS
>>>>>> > (code=exited, status=0/SUCCESS)*
>>>>>> >
>>>>>> > * slurmctld[2100]: cons_res: select_p_reconfigure*
>>>>>> > * slurmctld[2100]: cons_res: select_p_node_init*
>>>>>> > * slurmctld[2100]: cons_res: preparing for 1 partitions*
>>>>>> > * slurmctld[2100]: Running as primary controller*
>>>>>> > * slurmctld[2100]:
>>>>>> > SchedulerParameters=default_queue_depth=100,max_rpc_cnt=0,ma
>>>>>> x_sched_time=4,partition_job_depth=0*
>>>>>> > * slurmctld.service start operation timed out. Terminating.*
>>>>>> > *Terminate signal (SIGINT or SIGTERM) received*
>>>>>> > * slurmctld[2100]: Saving all slurm state*
>>>>>> > * Failed to start Slurm controller daemon.*
>>>>>> > * Unit slurmctld.service entered failed state.*
>>>>>>
>>>>>> Do you have a backup controller?
>>>>>> Check your slurm.conf under:
>>>>>> /etc/slurm-llnl
>>>>>>
>>>>>> Anyway I suggest to update the operating system to stretch and fix
>>>>>> your
>>>>>> configuration under a more recent version of slurm.
>>>>>> Best regards
>>>>>> --
>>>>>> Gennaro Oliva
>>>>>>
>>>>>>
>>>>>
>>>
>>
>>
>> --
>> --
>> Carles Fenoy
>>
>
>


-- 
--
Carles Fenoy
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20180115/d9e58301/attachment.html>


More information about the slurm-users mailing list