[slurm-users] Slurm not starting

Douglas Jacobsen dmjacobsen at lbl.gov
Mon Jan 15 08:50:40 MST 2018


Please check your slurm.conf on the compute nodes, I'm thinking that your
compute node isn't appearing in slurm.conf properly.

On Jan 15, 2018 07:45, "John Hearns" <hearnsj at googlemail.com> wrote:

> That's it. I am calling JohnH's Law:
> "Any problem with a batch queueing system is due to hostname resolution"
>
>
> On 15 January 2018 at 16:30, Elisabetta Falivene <e.falivene at ilabroma.com>
> wrote:
>
>> slurmd -Dvvv says
>>
>> slurmd: fatal: Unable to determine this slurmd's NodeName
>>
>> b
>>
>> 2018-01-15 15:58 GMT+01:00 Douglas Jacobsen <dmjacobsen at lbl.gov>:
>>
>>> The fact that sinfo is responding shows that at least slurmctld is
>>> running.  Slumd, on the other hand is not.  Please also get output of
>>> slurmd log or running "slurmd -Dvvv"
>>>
>>
>>
>>
>>
>>>
>>> On Jan 15, 2018 06:42, "Elisabetta Falivene" <e.falivene at ilabroma.com>
>>> wrote:
>>>
>>>> > Anyway I suggest to update the operating system to stretch and fix
>>>> your
>>>> > configuration under a more recent version of slurm.
>>>>
>>>> I think I'll soon arrive to that :)
>>>> b
>>>>
>>>> 2018-01-15 14:08 GMT+01:00 Gennaro Oliva <oliva.g at na.icar.cnr.it>:
>>>>
>>>>> Ciao Elisabetta,
>>>>>
>>>>> On Mon, Jan 15, 2018 at 01:13:27PM +0100, Elisabetta Falivene wrote:
>>>>> > Error messages are not much helping me in guessing what is going on.
>>>>> What
>>>>> > should I check to get what is failing?
>>>>>
>>>>> check slurmctld.log and slurmd.log, you can find them under
>>>>> /var/log/slurm-llnl
>>>>>
>>>>> > *PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST*
>>>>> > *batch*       up   infinite      8   unk* node[01-08]*
>>>>> >
>>>>> >
>>>>> > Running
>>>>> > *systemctl status slurmctld.service*
>>>>> >
>>>>> > returns
>>>>> >
>>>>> > *slurmctld.service - Slurm controller daemon*
>>>>> > *   Loaded: loaded (/lib/systemd/system/slurmctld.service; enabled)*
>>>>> > *   Active: failed (Result: timeout) since Mon 2018-01-15 13:03:39
>>>>> CET; 41s
>>>>> > ago*
>>>>> > *  Process: 2098 ExecStart=/usr/sbin/slurmctld $SLURMCTLD_OPTIONS
>>>>> > (code=exited, status=0/SUCCESS)*
>>>>> >
>>>>> > * slurmctld[2100]: cons_res: select_p_reconfigure*
>>>>> > * slurmctld[2100]: cons_res: select_p_node_init*
>>>>> > * slurmctld[2100]: cons_res: preparing for 1 partitions*
>>>>> > * slurmctld[2100]: Running as primary controller*
>>>>> > * slurmctld[2100]:
>>>>> > SchedulerParameters=default_queue_depth=100,max_rpc_cnt=0,ma
>>>>> x_sched_time=4,partition_job_depth=0*
>>>>> > * slurmctld.service start operation timed out. Terminating.*
>>>>> > *Terminate signal (SIGINT or SIGTERM) received*
>>>>> > * slurmctld[2100]: Saving all slurm state*
>>>>> > * Failed to start Slurm controller daemon.*
>>>>> > * Unit slurmctld.service entered failed state.*
>>>>>
>>>>> Do you have a backup controller?
>>>>> Check your slurm.conf under:
>>>>> /etc/slurm-llnl
>>>>>
>>>>> Anyway I suggest to update the operating system to stretch and fix your
>>>>> configuration under a more recent version of slurm.
>>>>> Best regards
>>>>> --
>>>>> Gennaro Oliva
>>>>>
>>>>>
>>>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20180115/331170f9/attachment-0001.html>


More information about the slurm-users mailing list