[slurm-users] monitoring and update regime for Power Saving nodes

Thu Feb 24 09:42:35 UTC 2022

Hi David,

it's also not actually a problem if the slurm.conf is not exactly the 
same immediately on boot - really. Unless there's changes that are very 
fundamental, nothing bad will happen if they pick up a new copy after, 
say, 5 or 10 minutes.

But it should be possible to - for example - force a run of your config 
management on startup (or before SLURM startup)?

(Not many ideas about the Nagios check, unless you change it to 
something that interrogates SLURM about node states, or keep some other 
record somewhere that it can interrogate about nodes meant to be down.)

Tina

On 24/02/2022 09:20, David Simpson wrote:
> Hi Brian,
> 
>  >>For monitoring, I use a combination of netdata+prometheus. Data is 
> gathered whenever the nodes are up and stored for history. Yes, when the 
> nodes are powered down, there are empty gaps, but that is interpreted as 
> the node is powered down.
> 
> Ah time-series will cope much better - at the moment our monitoring 
> system (for compute node health at least) is nagios-like, hence the 
> problem. Though there’s potential the entire cluster’s stack may change 
> at some point, so this problem will be more easy to deal with (with a 
> change of monitoring system for node health).
> 
>  >>For the config, I have no access to DNS for configless so I use a 
> symlink to the slurm.conf file a shared filesystem. This works great. 
> Anytime there are changes, a simple 'scontrol reconfigure' brings all 
> running nodes up to speed and any down nodes will automatically read the 
> latest.
> 
> Yes, currently we use file based and config written to the compute 
> node’s disks themselves via ansible. Perhaps we will consider moving the 
> file to a shared fs.
> 
> regards
> David
> 
> -------------
> 
> David Simpson - Senior Systems Engineer
> 
> ARCCA, Redwood Building,
> 
> King Edward VII Avenue,
> 
> Cardiff, CF10 3NB
> 
> David Simpson - peiriannydd uwch systemau
> 
> ARCCA, Adeilad Redwood,
> 
> King Edward VII Avenue,
> 
> Caerdydd, CF10 3NB
> 
> simpsond4 at cardiff.ac.uk <mailto:simpsond4 at cardiff.ac.uk>
> 
> +44 29208 74657
> 
> *From:*slurm-users <slurm-users-bounces at lists.schedmd.com> *On Behalf Of 
> *Brian Andrus
> *Sent:* 23 February 2022 15:27
> *To:* slurm-users at lists.schedmd.com
> *Subject:* Re: [slurm-users] monitoring and update regime for Power 
> Saving nodes
> 
> *External email to Cardiff University - *Take care when replying/opening 
> attachments or links.
> 
> *Nid ebost mewnol o Brifysgol Caerdydd yw hwn - *Cymerwch ofal wrth 
> ateb/agor atodiadau neu ddolenni.
> 
> David,
> 
> For monitoring, I use a combination of netdata+prometheus. Data is 
> gathered whenever the nodes are up and stored for history. Yes, when the 
> nodes are powered down, there are empty gaps, but that is interpreted as 
> the node is powered down.
> 
> For the config, I have no access to DNS for configless so I use a 
> symlink to the slurm.conf file a shared filesystem. This works great. 
> Anytime there are changes, a simple 'scontrol reconfigure' brings all 
> running nodes up to speed and any down nodes will automatically read the 
> latest.
> 
> Brian Andrus
> 
> On 2/23/2022 2:31 AM, David Simpson wrote:
> 
>     Hi all,
> 
>     Interested to know what common approaches were to:
> 
> 
>      1. Monitoring of power saving nodes (e.g. health of the node), when
>         potentially the monitoring system will see it go up and down. Do
>         you limit to BMC only monitoring/health?
>      2. When you want to make changes to slurm.conf (or anything else)
>         to a node which is down due to power saving (during a
>         maintenance/reservation) what is your approach? Do you end up
>         with 2 slurm.confs (one for power saving and one that keeps
>         everything up, to work on during the maintenance)?
> 
> 
>     thanks
>     David
> 
> 
>     -------------
> 
>     David Simpson - Senior Systems Engineer
> 
>     ARCCA, Redwood Building,
> 
>     King Edward VII Avenue,
> 
>     Cardiff, CF10 3NB
> 
>     David Simpson - peiriannydd uwch systemau
> 
>     ARCCA, Adeilad Redwood,
> 
>     King Edward VII Avenue,
> 
>     Caerdydd, CF10 3NB
> 
>     simpsond4 at cardiff.ac.uk <mailto:simpsond4 at cardiff.ac.uk>
> 
>     +44 29208 74657
> 

-- 
Tina Friedrich, Advanced Research Computing Snr HPC Systems Administrator

Research Computing and Support Services
IT Services, University of Oxford
http://www.arc.ox.ac.uk http://www.it.ox.ac.uk