[slurm-users] Nagios or Other Monitoring Plugins
Marcin Stolarek
stolarek.marcin at gmail.com
Fri Jan 19 00:56:54 MST 2018
We're using icinga2 storing accounting data in influxdb for grafana
dashboards. In terms of monitoring I prefere end-user functionality, so
apart from services we also have a plugin that submits a jobs to cluster
(to idle nodes, with a few minutes of deadline) the job simply creates
files on shared filesystem effectively monitoring slurmctl, slurmd, sssd,
filesystems etc.
cheers,
Marcin
2018-01-19 5:44 GMT+01:00 Ryan Novosielski <novosirj at rutgers.edu>:
> > On Jan 18, 2018, at 4:34 PM, Lachlan Musicman <datakid at gmail.com> wrote:
> >
> > On 19 January 2018 at 07:29, Ryan Novosielski <novosirj at rutgers.edu>
> wrote:
> > Hi all,
> >
> > Looked back at the mailing list to see if there was a question about
> this already. There was some mention of /using/ Nagios, but no real mention
> of specifics. What do people monitor with Nagios? We monitor, so far,
> slurmctld, slurmdbd, and MySQL, but there are probably some others. Might
> be helpful to run “scontrol ping” for example, or similar, on our login
> nodes.
> >
> > Does anyone have any plugins they’ve written or ideas they can share?
> Nagios Exchange doesn’t have anything with SLURM anywhere in the name.
> >
> > Thanks!
> >
> >
> > Off the top of my head the only other two that I would want explicitly
> would be:
> > - ntp/chrony and their respective ntpd. Nodes go offline when the
> timing slides too far, especially if you are using Munge.
> > - authentication system - in our case ipa/sssd. Without that, even the
> queued jobs will fail.
> >
> > We use Zabbix in house. I was under the impression that people were
> moving toward icingia2 over Nagios.
>
> I wouldn’t mind moving to Icinga2 over Nagios, but really, it’s more or
> less a nicer version of the same thing, so I’d have the same question with
> Icinga2.
>
> Thanks for the NTP/Chrony tip though — if I get only that from this
> thread, it will have been worth it. That’s caused us trouble more than
> once. We do already monitor our LDAP, but SSSD is a good idea.
>
> --
> ____
> || \\UTGERS, |---------------------------*
> O*---------------------------
> ||_// the State | Ryan Novosielski - novosirj at rutgers.edu
> || \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS
> Campus
> || \\ of NJ | Office of Advanced Research Computing - MSB C630, Newark
> `'
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20180119/9ef3fc4b/attachment.html>
More information about the slurm-users
mailing list