[slurm-users] Nagios or Other Monitoring Plugins

Michael Gutteridge michael.gutteridge at gmail.com
Thu Jan 18 19:51:35 MST 2018


We're moving to Prometheus for lots of our monitoring functions.  We've got
nagios and ganglia in place, but Prometheus and Grafana makes a really nice
combo for monitoring and alerting.

There's even an exporter for Slurm-
https://github.com/vpenso/prometheus-slurm-exporter that includes node
data, job information, and scheduling statistics.  Haven't had a chance to
install that yet, but I expect we'll be doing that soon: monitoring
scheduler performance is one area we need to watch a little closer.

Michael

On Thu, Jan 18, 2018 at 1:34 PM, Lachlan Musicman <datakid at gmail.com> wrote:

> On 19 January 2018 at 07:29, Ryan Novosielski <novosirj at rutgers.edu>
> wrote:
>
>> Hi all,
>>
>> Looked back at the mailing list to see if there was a question about this
>> already. There was some mention of /using/ Nagios, but no real mention of
>> specifics. What do people monitor with Nagios? We monitor, so far,
>> slurmctld, slurmdbd, and MySQL, but there are probably some others. Might
>> be helpful to run “scontrol ping” for example, or similar, on our login
>> nodes.
>>
>> Does anyone have any plugins they’ve written or ideas they can share?
>> Nagios Exchange doesn’t have anything with SLURM anywhere in the name.
>>
>> Thanks!
>>
>
>
> Off the top of my head the only other two that I would want explicitly
> would be:
>  - ntp/chrony and their respective ntpd. Nodes go offline when the timing
> slides too far, especially if you are using Munge.
>  - authentication system - in our case ipa/sssd. Without that, even the
> queued jobs will fail.
>
> We use Zabbix in house. I was under the impression that people were moving
> toward icingia2 over Nagios.
>
> Cheers
> L.
>
> ------
> "The antidote to apocalypticism is *apocalyptic civics*. Apocalyptic
> civics is the insistence that we cannot ignore the truth, nor should we
> panic about it. It is a shared consciousness that our institutions have
> failed and our ecosystem is collapsing, yet we are still here — and we are
> creative agents who can shape our destinies. Apocalyptic civics is the
> conviction that the only way out is through, and the only way through is
> together. "
>
> *Greg Bloom* @greggish https://twitter.com/greggish/
> status/873177525903609857
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20180118/5107442c/attachment-0001.html>


More information about the slurm-users mailing list