<div dir="ltr"><div class="gmail_default" style="font-family:monospace">We're moving to Prometheus for lots of our monitoring functions. We've got nagios and ganglia in place, but Prometheus and Grafana makes a really nice combo for monitoring and alerting.</div><div class="gmail_default" style="font-family:monospace"><br></div><div class="gmail_default" style="font-family:monospace">There's even an exporter for Slurm- <a href="https://github.com/vpenso/prometheus-slurm-exporter">https://github.com/vpenso/prometheus-slurm-exporter</a> that includes node data, job information, and scheduling statistics. Haven't had a chance to install that yet, but I expect we'll be doing that soon: monitoring scheduler performance is one area we need to watch a little closer.</div><div class="gmail_default" style="font-family:monospace"><br></div><div class="gmail_default" style="font-family:monospace">Michael</div></div><div class="gmail_extra"><br><div class="gmail_quote">On Thu, Jan 18, 2018 at 1:34 PM, Lachlan Musicman <span dir="ltr"><<a href="mailto:datakid@gmail.com" target="_blank">datakid@gmail.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr"><div class="gmail_extra"><div class="gmail_quote"><span class="">On 19 January 2018 at 07:29, Ryan Novosielski <span dir="ltr"><<a href="mailto:novosirj@rutgers.edu" target="_blank">novosirj@rutgers.edu</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">Hi all,<br>
<br>
Looked back at the mailing list to see if there was a question about this already. There was some mention of /using/ Nagios, but no real mention of specifics. What do people monitor with Nagios? We monitor, so far, slurmctld, slurmdbd, and MySQL, but there are probably some others. Might be helpful to run “scontrol ping” for example, or similar, on our login nodes.<br>
<br>
Does anyone have any plugins they’ve written or ideas they can share? Nagios Exchange doesn’t have anything with SLURM anywhere in the name.<br>
<br>
Thanks!<br></blockquote><div><br><br></div></span><div>Off the top of my head the only other two that I would want explicitly would be:<br></div><div> - ntp/chrony and their respective ntpd. Nodes go offline when the timing slides too far, especially if you are using Munge.<br></div><div> - authentication system - in our case ipa/sssd. Without that, even the queued jobs will fail.<br></div><div><br></div><div>We use Zabbix in house. I was under the impression that people were moving toward icingia2 over Nagios. <br></div><div><br></div><div>Cheers<br></div><div>L.<br></div><div><br clear="all">------<br>"The antidote to apocalypticism is
<b>apocalyptic civics</b>. Apocalyptic civics is the
insistence that we cannot ignore the truth, nor should we panic about
it. It is a shared consciousness that our institutions have failed and
our ecosystem is collapsing, yet we are still here — and we are creative
agents who can shape our destinies. Apocalyptic civics is the
conviction that the only way out is through, and the only way through is
together. "<br><br><i>Greg Bloom</i> @greggish <a href="https://twitter.com/greggish/status/873177525903609857" target="_blank">https://twitter.com/greggish/<wbr>status/873177525903609857</a> <br></div></div></div></div>
</blockquote></div><br></div>