[slurm-users] monitoring and update regime for Power Saving nodes

Brian Andrus toomuchit at gmail.com
Wed Feb 23 15:27:28 UTC 2022


David,

For monitoring, I use a combination of netdata+prometheus. Data is 
gathered whenever the nodes are up and stored for history. Yes, when the 
nodes are powered down, there are empty gaps, but that is interpreted as 
the node is powered down.

For the config, I have no access to DNS for configless so I use a 
symlink to the slurm.conf file a shared filesystem. This works great. 
Anytime there are changes, a simple 'scontrol reconfigure' brings all 
running nodes up to speed and any down nodes will automatically read the 
latest.

Brian Andrus

On 2/23/2022 2:31 AM, David Simpson wrote:
>
> Hi all,
>
> Interested to know what common approaches were to:
>
>   * Monitoring of power saving nodes (e.g. health of the node), when
>     potentially the monitoring system will see it go up and down. Do
>     you limit to BMC only monitoring/health?
>   * When you want to make changes to slurm.conf (or anything else) to
>     a node which is down due to power saving (during a
>     maintenance/reservation) what is your approach? Do you end up with
>     2 slurm.confs (one for power saving and one that keeps everything
>     up, to work on during the maintenance)?
>
>
> thanks
> David
>
> -------------
>
> David Simpson - Senior Systems Engineer
>
> ARCCA, Redwood Building,
>
> King Edward VII Avenue,
>
> Cardiff, CF10 3NB
>
> David Simpson - peiriannydd uwch systemau
>
> ARCCA, Adeilad Redwood,
>
> King Edward VII Avenue,
>
> Caerdydd, CF10 3NB
>
> simpsond4 at cardiff.ac.uk <mailto:simpsond4 at cardiff.ac.uk>
>
> +44 29208 74657
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20220223/a0048db0/attachment.htm>


More information about the slurm-users mailing list