[slurm-users] monitoring and update regime for Power Saving nodes
Hermann Schwärzler
hermann.schwaerzler at uibk.ac.at
Thu Feb 24 10:26:40 UTC 2022
Hi everybody,
for forcing a run of your config management as Tina suggested you might
just add a
ExecStartPre=
line to your slurmd.service file?
This is somewhat unrelated to your problem but we are very successfully
using
ExecStartPre=-/usr/bin/nvidia-smi -L
in our slurmd.service file to make sure that all the GPU-devices are
visible and available on our GPU-nodes *before* slurmd starts. Of course
the dash after the "=" is important to make systemd ignore potential
errors when running that command.
Hermann
On 2/24/22 10:42 AM, Tina Friedrich wrote:
> Hi David,
>
> it's also not actually a problem if the slurm.conf is not exactly the
> same immediately on boot - really. Unless there's changes that are very
> fundamental, nothing bad will happen if they pick up a new copy after,
> say, 5 or 10 minutes.
>
> But it should be possible to - for example - force a run of your config
> management on startup (or before SLURM startup)?
>
> (Not many ideas about the Nagios check, unless you change it to
> something that interrogates SLURM about node states, or keep some other
> record somewhere that it can interrogate about nodes meant to be down.)
>
> Tina
>
> On 24/02/2022 09:20, David Simpson wrote:
>> Hi Brian,
>>
>> >>For monitoring, I use a combination of netdata+prometheus. Data is
>> gathered whenever the nodes are up and stored for history. Yes, when
>> the nodes are powered down, there are empty gaps, but that is
>> interpreted as the node is powered down.
>>
>> Ah time-series will cope much better - at the moment our monitoring
>> system (for compute node health at least) is nagios-like, hence the
>> problem. Though there’s potential the entire cluster’s stack may
>> change at some point, so this problem will be more easy to deal with
>> (with a change of monitoring system for node health).
>>
>> >>For the config, I have no access to DNS for configless so I use a
>> symlink to the slurm.conf file a shared filesystem. This works great.
>> Anytime there are changes, a simple 'scontrol reconfigure' brings all
>> running nodes up to speed and any down nodes will automatically read
>> the latest.
>>
>> Yes, currently we use file based and config written to the compute
>> node’s disks themselves via ansible. Perhaps we will consider moving
>> the file to a shared fs.
>>
>> regards
>> David
>>
>> -------------
>>
>> David Simpson - Senior Systems Engineer
>>
>> ARCCA, Redwood Building,
>>
>> King Edward VII Avenue,
>>
>> Cardiff, CF10 3NB
>>
>> David Simpson - peiriannydd uwch systemau
>>
>> ARCCA, Adeilad Redwood,
>>
>> King Edward VII Avenue,
>>
>> Caerdydd, CF10 3NB
>>
>> simpsond4 at cardiff.ac.uk <mailto:simpsond4 at cardiff.ac.uk>
>>
>> +44 29208 74657
>>
>> *From:*slurm-users <slurm-users-bounces at lists.schedmd.com> *On Behalf
>> Of *Brian Andrus
>> *Sent:* 23 February 2022 15:27
>> *To:* slurm-users at lists.schedmd.com
>> *Subject:* Re: [slurm-users] monitoring and update regime for Power
>> Saving nodes
>>
>> *External email to Cardiff University - *Take care when
>> replying/opening attachments or links.
>>
>> *Nid ebost mewnol o Brifysgol Caerdydd yw hwn - *Cymerwch ofal wrth
>> ateb/agor atodiadau neu ddolenni.
>>
>> David,
>>
>> For monitoring, I use a combination of netdata+prometheus. Data is
>> gathered whenever the nodes are up and stored for history. Yes, when
>> the nodes are powered down, there are empty gaps, but that is
>> interpreted as the node is powered down.
>>
>> For the config, I have no access to DNS for configless so I use a
>> symlink to the slurm.conf file a shared filesystem. This works great.
>> Anytime there are changes, a simple 'scontrol reconfigure' brings all
>> running nodes up to speed and any down nodes will automatically read
>> the latest.
>>
>> Brian Andrus
>>
>> On 2/23/2022 2:31 AM, David Simpson wrote:
>>
>> Hi all,
>>
>> Interested to know what common approaches were to:
>>
>>
>> 1. Monitoring of power saving nodes (e.g. health of the node), when
>> potentially the monitoring system will see it go up and down. Do
>> you limit to BMC only monitoring/health?
>> 2. When you want to make changes to slurm.conf (or anything else)
>> to a node which is down due to power saving (during a
>> maintenance/reservation) what is your approach? Do you end up
>> with 2 slurm.confs (one for power saving and one that keeps
>> everything up, to work on during the maintenance)?
>>
>>
>> thanks
>> David
>>
>>
>> -------------
>>
>> David Simpson - Senior Systems Engineer
>>
>> ARCCA, Redwood Building,
>>
>> King Edward VII Avenue,
>>
>> Cardiff, CF10 3NB
>>
>> David Simpson - peiriannydd uwch systemau
>>
>> ARCCA, Adeilad Redwood,
>>
>> King Edward VII Avenue,
>>
>> Caerdydd, CF10 3NB
>>
>> simpsond4 at cardiff.ac.uk <mailto:simpsond4 at cardiff.ac.uk>
>>
>> +44 29208 74657
>>
>
More information about the slurm-users
mailing list