[slurm-users] unable to start slurmd process.

navin srivastava navin.altair at gmail.com
Sat Jun 13 05:21:19 UTC 2020


Hi Team,

After my Analysis i found that the user used the qdel command which is a
plugin with slurm and the job is not killed properly and it makes the
slurmstepd process in a kind of hung state. so when i was trying to start
the slurmd the process was not getting started.after killing those
processes. slurmd started without any issues.

Regards
Navin.




On Thu, Jun 11, 2020 at 9:23 PM Riebs, Andy <andy.riebs at hpe.com> wrote:

> Short of getting on the system and kicking the tires myself, I’m fresh out
> of ideas. Does “sinfo -R” offer any hints?
>
>
>
> *From:* slurm-users [mailto:slurm-users-bounces at lists.schedmd.com] *On
> Behalf Of *navin srivastava
> *Sent:* Thursday, June 11, 2020 11:31 AM
> *To:* Slurm User Community List <slurm-users at lists.schedmd.com>
> *Subject:* Re: [slurm-users] unable to start slurmd process.
>
>
>
> i am able to get the output scontrol show node oled3
>
> also the oled3 is pinging fine
>
>
>
> and scontrol ping output showing like
>
>
>
> Slurmctld(primary/backup) at deda1x1466/(NULL) are UP/DOWN
>
>
>
> so all looks ok to me.
>
>
>
> REgards
>
> Navin.
>
>
>
>
>
>
>
> On Thu, Jun 11, 2020 at 8:38 PM Riebs, Andy <andy.riebs at hpe.com> wrote:
>
> So there seems to be a failure to communicate between slurmctld and the
> oled3 slurmd.
>
>
>
> From oled3, try “scontrol ping” to confirm that it can see the slurmctld
> daemon.
>
>
>
> From the head node, try “scontrol show node oled3”, and then ping the
> address that is shown for “NodeAddr=”
>
>
>
> *From:* slurm-users [mailto:slurm-users-bounces at lists.schedmd.com] *On
> Behalf Of *navin srivastava
> *Sent:* Thursday, June 11, 2020 10:40 AM
> *To:* Slurm User Community List <slurm-users at lists.schedmd.com>
> *Subject:* Re: [slurm-users] unable to start slurmd process.
>
>
>
> i collected the log from slurmctld and it says below
>
>
>
> [2020-06-10T20:10:38.501] Resending TERMINATE_JOB request JobId=1252284
> Nodelist=oled3
> [2020-06-10T20:14:38.901] Resending TERMINATE_JOB request JobId=1252284
> Nodelist=oled3
> [2020-06-10T20:18:38.255] Resending TERMINATE_JOB request JobId=1252284
> Nodelist=oled3
> [2020-06-10T20:22:38.624] Resending TERMINATE_JOB request JobId=1252284
> Nodelist=oled3
> [2020-06-10T20:26:38.902] Resending TERMINATE_JOB request JobId=1252284
> Nodelist=oled3
> [2020-06-10T20:30:38.230] Resending TERMINATE_JOB request JobId=1252284
> Nodelist=oled3
> [2020-06-10T20:34:38.594] Resending TERMINATE_JOB request JobId=1252284
> Nodelist=oled3
> [2020-06-10T20:38:38.986] Resending TERMINATE_JOB request JobId=1252284
> Nodelist=oled3
> [2020-06-10T20:42:38.402] Resending TERMINATE_JOB request JobId=1252284
> Nodelist=oled3
> [2020-06-10T20:46:38.764] Resending TERMINATE_JOB request JobId=1252284
> Nodelist=oled3
> [2020-06-10T20:50:38.094] Resending TERMINATE_JOB request JobId=1252284
> Nodelist=oled3
> [2020-06-10T21:26:38.839] Resending TERMINATE_JOB request JobId=1252284
> Nodelist=oled3
> [2020-06-10T21:30:38.225] Resending TERMINATE_JOB request JobId=1252284
> Nodelist=oled3
> [2020-06-10T21:34:38.582] Resending TERMINATE_JOB request JobId=1252284
> Nodelist=oled3
> [2020-06-10T21:38:38.914] Resending TERMINATE_JOB request JobId=1252284
> Nodelist=oled3
> [2020-06-10T21:42:38.292] Resending TERMINATE_JOB request JobId=1252284
> Nodelist=oled3
> [2020-06-10T21:46:38.542] Resending TERMINATE_JOB request JobId=1252284
> Nodelist=oled3
> [2020-06-10T21:50:38.869] Resending TERMINATE_JOB request JobId=1252284
> Nodelist=oled3
> [2020-06-10T21:54:38.227] Resending TERMINATE_JOB request JobId=1252284
> Nodelist=oled3
> [2020-06-10T21:58:38.628] Resending TERMINATE_JOB request JobId=1252284
> Nodelist=oled3
> [2020-06-11T06:54:39.012] Resending TERMINATE_JOB request JobId=1252284
> Nodelist=oled3
> [2020-06-11T06:58:39.411] Resending TERMINATE_JOB request JobId=1252284
> Nodelist=oled3
> [2020-06-11T07:02:39.106] Resending TERMINATE_JOB request JobId=1252284
> Nodelist=oled3
> [2020-06-11T07:06:39.495] Resending TERMINATE_JOB request JobId=1252284
> Nodelist=oled3
> [2020-06-11T07:10:39.814] Resending TERMINATE_JOB request JobId=1252284
> Nodelist=oled3
> [2020-06-11T07:14:39.188] Resending TERMINATE_JOB request JobId=1252284
> Nodelist=oled3
> [2020-06-11T07:14:49.204] agent/is_node_resp: node:oled3
> RPC:REQUEST_TERMINATE_JOB : Communication connection failure
> [2020-06-11T07:14:50.210] error: Nodes oled3 not responding
> [2020-06-11T07:15:54.313] error: Nodes oled3 not responding
> [2020-06-11T07:17:34.407] error: Nodes oled3 not responding
> [2020-06-11T07:19:14.637] error: Nodes oled3 not responding
> [2020-06-11T07:19:54.313] update_node: node oled3 reason set to:
> reboot-required
> [2020-06-11T07:19:54.313] update_node: node oled3 state set to DRAINING*
> [2020-06-11T07:20:43.788] requeue job 1316970 due to failure of node oled3
> [2020-06-11T07:20:43.788] requeue job 1349322 due to failure of node oled3
> [2020-06-11T07:20:43.789] error: Nodes oled3 not responding, setting DOWN
>
>
>
> sinfo says
>
>
>
> OLED*           up   infinite      1 drain* oled3
>
>
>
> while checking the node i feel node is healthy.
>
>
>
> Regards
>
> Navin
>
>
>
> On Thu, Jun 11, 2020 at 7:21 PM Riebs, Andy <andy.riebs at hpe.com> wrote:
>
> Weird. “slurmd -Dvvv” ought to report a whole lot of data; I can’t guess
> how to interpret it not reporting anything but the “log file” and “munge”
> messages. When you have it running attached to your window, is there any
> chance that sinfo or scontrol suggest that the node is actually all right?
> Perhaps something in /etc/sysconfig/slurm or the like is messed up?
>
>
>
> If that’s not the case, I think my next step would be to follow up on
> someone else’s suggestion, and scan the slurmctld.log file for the problem
> node name.
>
>
>
> *From:* slurm-users [mailto:slurm-users-bounces at lists.schedmd.com] *On
> Behalf Of *navin srivastava
> *Sent:* Thursday, June 11, 2020 9:26 AM
> *To:* Slurm User Community List <slurm-users at lists.schedmd.com>
> *Subject:* Re: [slurm-users] unable to start slurmd process.
>
>
>
> Sorry Andy I missed to add.
>
> 1st i tried the  slurmd -Dvvv and it is not written anything
>
> slurmd: debug:  Log file re-opened
> slurmd: debug:  Munge authentication plugin loaded
>
>
>
> After that I waited for 10-20 minutes but no output and finally i pressed
> Ctrl^c.
>
>
>
> My doubt is in slurm.conf file:
>
>
>
> ControlMachine=deda1x1466
> ControlAddr=192.168.150.253
>
>
>
> The deda1x1466 is having a different interface with different IP which
> compute node is unable to ping but IP is pingable.
>
> could be one of the reason?
>
>
>
> but other nodes having the same config and there i am able to start the
> slurmd. so bit of confusion.
>
>
>
> Regards
>
> Navin.
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> Regards
>
> Navin.
>
>
>
>
>
>
>
>
>
>
>
>
>
> On Thu, Jun 11, 2020 at 6:44 PM Riebs, Andy <andy.riebs at hpe.com> wrote:
>
> If you omitted the “-D” that I suggested, then the daemon would have
> detached and logged nothing on the screen. In this case, you can still go
> to the slurmd log (use “scontrol show config | grep -I log” if you’re not
> sure where the logs are stored).
>
>
>
> *From:* slurm-users [mailto:slurm-users-bounces at lists.schedmd.com] *On
> Behalf Of *navin srivastava
> *Sent:* Thursday, June 11, 2020 9:01 AM
> *To:* Slurm User Community List <slurm-users at lists.schedmd.com>
> *Subject:* Re: [slurm-users] unable to start slurmd process.
>
>
>
> I tried by executing the debug mode but there also it is not writing
> anything.
>
>
>
> i waited for about 5-10 minutes
>
>
>
> deda1x1452:/etc/sysconfig # /usr/sbin/slurmd -v -v
>
> No output on terminal.
>
>
>
> The OS is SLES12-SP4 . All firewall services are disabled.
>
>
>
> The recent change is the local hostname earlier it was with local hostname
> node1,node2,etc but we have moved to dns based hostname which is deda
>
>
>
> NodeName=node[1-12] NodeHostname=deda1x[1450-1461] NodeAddr=node[1-12]
> Sockets=2 CoresPerSocket=10 State=UNKNOWN
>
> other than this it is fine but after that i have done several time slurmd
> process started on the node and it works fine but now i am seeing this
> issue today.
>
>
>
> Regards
>
> Navin.
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> On Thu, Jun 11, 2020 at 6:06 PM Riebs, Andy <andy.riebs at hpe.com> wrote:
>
> Navin,
>
>
>
> As you can see, systemd provides very little service-specific information.
> For slurm, you really need to go to the slurm logs to find out what
> happened.
>
>
>
> Hint: A quick way to identify problems like this with slurmd and slurmctld
> is to run them with the “-Dvvv” option, causing them to log to your window,
> and usually causing the problem to become immediately obvious.
>
>
>
> For example,
>
>
>
> # /usr/local/slurm/sbin/slurmd -Dvvvv
>
>
>
> Just it ^C when you’re done, if necessary. Of course, if it doesn’t fail
> when you run it this way, it’s time to look elsewhere.
>
>
>
> Andy
>
>
>
> *From:* slurm-users [mailto:slurm-users-bounces at lists.schedmd.com] *On
> Behalf Of *navin srivastava
> *Sent:* Thursday, June 11, 2020 8:25 AM
> *To:* Slurm User Community List <slurm-users at lists.schedmd.com>
> *Subject:* [slurm-users] unable to start slurmd process.
>
>
>
> Hi Team,
>
>
>
> when i am trying to start the slurmd process i am getting the below error.
>
>
>
> 2020-06-11T13:11:58.652711+02:00 oled3 systemd[1]: Starting Slurm node
> daemon...
> 2020-06-11T13:13:28.683840+02:00 oled3 systemd[1]: slurmd.service: Start
> operation timed out. Terminating.
> 2020-06-11T13:13:28.684479+02:00 oled3 systemd[1]: Failed to start Slurm
> node daemon.
> 2020-06-11T13:13:28.684759+02:00 oled3 systemd[1]: slurmd.service: Unit
> entered failed state.
> 2020-06-11T13:13:28.684917+02:00 oled3 systemd[1]: slurmd.service: Failed
> with result 'timeout'.
> 2020-06-11T13:15:01.437172+02:00 oled3 cron[8094]:
> pam_unix(crond:session): session opened for user root by (uid=0)
>
>
>
> Slurm version is 17.11.8
>
>
>
> The server and slurm is running from long time and we have not made any
> changes but today when i am starting it is giving this error message.
>
> Any idea what could be wrong here.
>
>
>
> Regards
>
> Navin.
>
>
>
>
>
>
>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20200613/00f79729/attachment-0001.htm>


More information about the slurm-users mailing list