[slurm-users] Slurm not starting

Elisabetta Falivene e.falivene at ilabroma.com
Mon Jan 15 05:13:27 MST 2018


I did an upgrade from wheezy to jessie (automatically with a normal
dist-upgrade) on a cluster with 8 nodes (up, running and reachable) and
from slurm 2.3.4 to 14.03.9. Overcame some problems booting kernel (thank
you vey much to Gennaro Oliva, btw), now the system is running correctly
with kernel 3.16.0.4, but slurm isn't starting. I tried restarting
services, but it seems it isn't able to do it.

Error messages are not much helping me in guessing what is going on. What
should I check to get what is failing?

Thank you
Elisabetta

PS: Here it is some tests I did

Running
*sinfo*

returns

*PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST*
*batch*       up   infinite      8   unk* node[01-08]*


Running
*systemctl status slurmctld.service*

returns

*slurmctld.service - Slurm controller daemon*
*   Loaded: loaded (/lib/systemd/system/slurmctld.service; enabled)*
*   Active: failed (Result: timeout) since Mon 2018-01-15 13:03:39 CET; 41s
ago*
*  Process: 2098 ExecStart=/usr/sbin/slurmctld $SLURMCTLD_OPTIONS
(code=exited, status=0/SUCCESS)*

* slurmctld[2100]: cons_res: select_p_reconfigure*
* slurmctld[2100]: cons_res: select_p_node_init*
* slurmctld[2100]: cons_res: preparing for 1 partitions*
* slurmctld[2100]: Running as primary controller*
* slurmctld[2100]:
SchedulerParameters=default_queue_depth=100,max_rpc_cnt=0,max_sched_time=4,partition_job_depth=0*
* slurmctld.service start operation timed out. Terminating.*
*Terminate signal (SIGINT or SIGTERM) received*
* slurmctld[2100]: Saving all slurm state*
* Failed to start Slurm controller daemon.*
* Unit slurmctld.service entered failed state.*

and running

*/etc/init.d/slurmd status*

returns

*slurmd.service - Slurm node daemon*
*   Loaded: loaded (/lib/systemd/system/slurmd.service; enabled)*
*   Active: failed (Result: exit-code) since Mon 2018-01-15 12:44:52 CET;
21min ago*
*  Process: 729 ExecStart=/usr/sbin/slurmd $SLURMD_OPTIONS (code=exited,
status=1/FAILURE)*

* slurmd.service: control process exited, code=exited status=1*
*systemd[1]: Failed to start Slurm node daemon.*
*Unit slurmd.service entered failed state.*
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20180115/bcb17beb/attachment-0001.html>


More information about the slurm-users mailing list