[slurm-users] unable to start slurmd process.

Jeffrey T Frey frey at udel.edu
Thu Jun 11 13:14:12 UTC 2020


Is the time on that node too far out-of-sync w.r.t. the slurmctld server?



> On Jun 11, 2020, at 09:01 , navin srivastava <navin.altair at gmail.com> wrote:
> 
> I tried by executing the debug mode but there also it is not writing anything.
> 
> i waited for about 5-10 minutes
> 
> deda1x1452:/etc/sysconfig # /usr/sbin/slurmd -v -v
> 
> No output on terminal. 
> 
> The OS is SLES12-SP4 . All firewall services are disabled.
> 
> The recent change is the local hostname earlier it was with local hostname node1,node2,etc but we have moved to dns based hostname which is deda
> 
> NodeName=node[1-12] NodeHostname=deda1x[1450-1461] NodeAddr=node[1-12] Sockets=2 CoresPerSocket=10 State=UNKNOWN
> other than this it is fine but after that i have done several time slurmd process started on the node and it works fine but now i am seeing this issue today.
> 
> Regards
> Navin.
> 
> 
> 
> 
> 
> 
> 
> 
> 
> On Thu, Jun 11, 2020 at 6:06 PM Riebs, Andy <andy.riebs at hpe.com> wrote:
> Navin,
> 
>  
> 
> As you can see, systemd provides very little service-specific information. For slurm, you really need to go to the slurm logs to find out what happened.
> 
>  
> 
> Hint: A quick way to identify problems like this with slurmd and slurmctld is to run them with the “-Dvvv” option, causing them to log to your window, and usually causing the problem to become immediately obvious.
> 
>  
> 
> For example,
> 
>  
> 
> # /usr/local/slurm/sbin/slurmd -Dvvvv
> 
>  
> 
> Just it ^C when you’re done, if necessary. Of course, if it doesn’t fail when you run it this way, it’s time to look elsewhere.
> 
>  
> 
> Andy
> 
>  
> 
> From: slurm-users [mailto:slurm-users-bounces at lists.schedmd.com] On Behalf Of navin srivastava
> Sent: Thursday, June 11, 2020 8:25 AM
> To: Slurm User Community List <slurm-users at lists.schedmd.com>
> Subject: [slurm-users] unable to start slurmd process.
> 
>  
> 
> Hi Team,
> 
>  
> 
> when i am trying to start the slurmd process i am getting the below error.
> 
>  
> 
> 2020-06-11T13:11:58.652711+02:00 oled3 systemd[1]: Starting Slurm node daemon...
> 2020-06-11T13:13:28.683840+02:00 oled3 systemd[1]: slurmd.service: Start operation timed out. Terminating.
> 2020-06-11T13:13:28.684479+02:00 oled3 systemd[1]: Failed to start Slurm node daemon.
> 2020-06-11T13:13:28.684759+02:00 oled3 systemd[1]: slurmd.service: Unit entered failed state.
> 2020-06-11T13:13:28.684917+02:00 oled3 systemd[1]: slurmd.service: Failed with result 'timeout'.
> 2020-06-11T13:15:01.437172+02:00 oled3 cron[8094]: pam_unix(crond:session): session opened for user root by (uid=0)
> 
>  
> 
> Slurm version is 17.11.8
> 
>  
> 
> The server and slurm is running from long time and we have not made any changes but today when i am starting it is giving this error message. 
> 
> Any idea what could be wrong here.
> 
>  
> 
> Regards
> 
> Navin.
> 
>  
> 
>  
> 
>  
> 
>  
> 




More information about the slurm-users mailing list