[slurm-users] Slurm not starting

Mon Jan 15 08:57:14 MST 2018

Googling a bit, the error "slurmd: fatal: Unable to determine this slurmd's
NodeName" come up when you try to check slurmd on the master which
shouldn't execute slurmd(?). It must be up on the nodes, not on the master.

2018-01-15 16:50 GMT+01:00 Douglas Jacobsen <dmjacobsen at lbl.gov>:

> Please check your slurm.conf on the compute nodes, I'm thinking that your
> compute node isn't appearing in slurm.conf properly.
>
> On Jan 15, 2018 07:45, "John Hearns" <hearnsj at googlemail.com> wrote:
>
>> That's it. I am calling JohnH's Law:
>> "Any problem with a batch queueing system is due to hostname resolution"
>>
>>
>> On 15 January 2018 at 16:30, Elisabetta Falivene <e.falivene at ilabroma.com
>> > wrote:
>>
>>> slurmd -Dvvv says
>>>
>>> slurmd: fatal: Unable to determine this slurmd's NodeName
>>>
>>> b
>>>
>>> 2018-01-15 15:58 GMT+01:00 Douglas Jacobsen <dmjacobsen at lbl.gov>:
>>>
>>>> The fact that sinfo is responding shows that at least slurmctld is
>>>> running.  Slumd, on the other hand is not.  Please also get output of
>>>> slurmd log or running "slurmd -Dvvv"
>>>>
>>>
>>>
>>>
>>>
>>>>
>>>> On Jan 15, 2018 06:42, "Elisabetta Falivene" <e.falivene at ilabroma.com>
>>>> wrote:
>>>>
>>>>> > Anyway I suggest to update the operating system to stretch and fix
>>>>> your
>>>>> > configuration under a more recent version of slurm.
>>>>>
>>>>> I think I'll soon arrive to that :)
>>>>> b
>>>>>
>>>>> 2018-01-15 14:08 GMT+01:00 Gennaro Oliva <oliva.g at na.icar.cnr.it>:
>>>>>
>>>>>> Ciao Elisabetta,
>>>>>>
>>>>>> On Mon, Jan 15, 2018 at 01:13:27PM +0100, Elisabetta Falivene wrote:
>>>>>> > Error messages are not much helping me in guessing what is going
>>>>>> on. What
>>>>>> > should I check to get what is failing?
>>>>>>
>>>>>> check slurmctld.log and slurmd.log, you can find them under
>>>>>> /var/log/slurm-llnl
>>>>>>
>>>>>> > *PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST*
>>>>>> > *batch*       up   infinite      8   unk* node[01-08]*
>>>>>> >
>>>>>> >
>>>>>> > Running
>>>>>> > *systemctl status slurmctld.service*
>>>>>> >
>>>>>> > returns
>>>>>> >
>>>>>> > *slurmctld.service - Slurm controller daemon*
>>>>>> > *   Loaded: loaded (/lib/systemd/system/slurmctld.service;
>>>>>> enabled)*
>>>>>> > *   Active: failed (Result: timeout) since Mon 2018-01-15 13:03:39
>>>>>> CET; 41s
>>>>>> > ago*
>>>>>> > *  Process: 2098 ExecStart=/usr/sbin/slurmctld $SLURMCTLD_OPTIONS
>>>>>> > (code=exited, status=0/SUCCESS)*
>>>>>> >
>>>>>> > * slurmctld[2100]: cons_res: select_p_reconfigure*
>>>>>> > * slurmctld[2100]: cons_res: select_p_node_init*
>>>>>> > * slurmctld[2100]: cons_res: preparing for 1 partitions*
>>>>>> > * slurmctld[2100]: Running as primary controller*
>>>>>> > * slurmctld[2100]:
>>>>>> > SchedulerParameters=default_queue_depth=100,max_rpc_cnt=0,ma
>>>>>> x_sched_time=4,partition_job_depth=0*
>>>>>> > * slurmctld.service start operation timed out. Terminating.*
>>>>>> > *Terminate signal (SIGINT or SIGTERM) received*
>>>>>> > * slurmctld[2100]: Saving all slurm state*
>>>>>> > * Failed to start Slurm controller daemon.*
>>>>>> > * Unit slurmctld.service entered failed state.*
>>>>>>
>>>>>> Do you have a backup controller?
>>>>>> Check your slurm.conf under:
>>>>>> /etc/slurm-llnl
>>>>>>
>>>>>> Anyway I suggest to update the operating system to stretch and fix
>>>>>> your
>>>>>> configuration under a more recent version of slurm.
>>>>>> Best regards
>>>>>> --
>>>>>> Gennaro Oliva
>>>>>>
>>>>>>
>>>>>
>>>
>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20180115/a0cdad8a/attachment-0001.html>