[slurm-users] Slurm not starting

John Hearns hearnsj at googlemail.com
Mon Jan 15 08:35:30 MST 2018


That's it. I am calling JohnH's Law:
"Any problem with a batch queueing system is due to hostname resolution"


On 15 January 2018 at 16:30, Elisabetta Falivene <e.falivene at ilabroma.com>
wrote:

> slurmd -Dvvv says
>
> slurmd: fatal: Unable to determine this slurmd's NodeName
>
> b
>
> 2018-01-15 15:58 GMT+01:00 Douglas Jacobsen <dmjacobsen at lbl.gov>:
>
>> The fact that sinfo is responding shows that at least slurmctld is
>> running.  Slumd, on the other hand is not.  Please also get output of
>> slurmd log or running "slurmd -Dvvv"
>>
>
>
>
>
>>
>> On Jan 15, 2018 06:42, "Elisabetta Falivene" <e.falivene at ilabroma.com>
>> wrote:
>>
>>> > Anyway I suggest to update the operating system to stretch and fix your
>>> > configuration under a more recent version of slurm.
>>>
>>> I think I'll soon arrive to that :)
>>> b
>>>
>>> 2018-01-15 14:08 GMT+01:00 Gennaro Oliva <oliva.g at na.icar.cnr.it>:
>>>
>>>> Ciao Elisabetta,
>>>>
>>>> On Mon, Jan 15, 2018 at 01:13:27PM +0100, Elisabetta Falivene wrote:
>>>> > Error messages are not much helping me in guessing what is going on.
>>>> What
>>>> > should I check to get what is failing?
>>>>
>>>> check slurmctld.log and slurmd.log, you can find them under
>>>> /var/log/slurm-llnl
>>>>
>>>> > *PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST*
>>>> > *batch*       up   infinite      8   unk* node[01-08]*
>>>> >
>>>> >
>>>> > Running
>>>> > *systemctl status slurmctld.service*
>>>> >
>>>> > returns
>>>> >
>>>> > *slurmctld.service - Slurm controller daemon*
>>>> > *   Loaded: loaded (/lib/systemd/system/slurmctld.service; enabled)*
>>>> > *   Active: failed (Result: timeout) since Mon 2018-01-15 13:03:39
>>>> CET; 41s
>>>> > ago*
>>>> > *  Process: 2098 ExecStart=/usr/sbin/slurmctld $SLURMCTLD_OPTIONS
>>>> > (code=exited, status=0/SUCCESS)*
>>>> >
>>>> > * slurmctld[2100]: cons_res: select_p_reconfigure*
>>>> > * slurmctld[2100]: cons_res: select_p_node_init*
>>>> > * slurmctld[2100]: cons_res: preparing for 1 partitions*
>>>> > * slurmctld[2100]: Running as primary controller*
>>>> > * slurmctld[2100]:
>>>> > SchedulerParameters=default_queue_depth=100,max_rpc_cnt=0,ma
>>>> x_sched_time=4,partition_job_depth=0*
>>>> > * slurmctld.service start operation timed out. Terminating.*
>>>> > *Terminate signal (SIGINT or SIGTERM) received*
>>>> > * slurmctld[2100]: Saving all slurm state*
>>>> > * Failed to start Slurm controller daemon.*
>>>> > * Unit slurmctld.service entered failed state.*
>>>>
>>>> Do you have a backup controller?
>>>> Check your slurm.conf under:
>>>> /etc/slurm-llnl
>>>>
>>>> Anyway I suggest to update the operating system to stretch and fix your
>>>> configuration under a more recent version of slurm.
>>>> Best regards
>>>> --
>>>> Gennaro Oliva
>>>>
>>>>
>>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20180115/59df38aa/attachment.html>


More information about the slurm-users mailing list