[slurm-users] SLURM: reconfig

Brian Andrus toomuchit at gmail.com
Thu May 5 13:48:28 UTC 2022


@Tina,

Figure slurmd reads the config in ones and runs with it. You would need 
to have it recheck regularly to see if there are any changes. This is 
exactly what 'scontrol reconfig' does: tells all the slurm nodes to 
recheck the config.


@Steven,

It seems to me you could just have a monitor daemon that keeps things 
up-to-date.
It could watch for the alert that AWS sends (2 minute warning, IIRC) and 
take appropriate action of drain the node and cancel/checkpoint a job.
In addition, it could keep an eye on things in the event a warning 
wasn't received and a node 'vanishes'.  I suspect Nagios even has the 
hooks to make that work. You could also email the user to let them know 
their job was ended due to spot being pulled.

Just some ideas,

Brian Andrus

On 5/5/2022 6:28 AM, Steven Varga wrote:
> Hi Tina,
> Thank you for sharing. This matches my observations when I checked if 
> slurm could do what I am upto: manage AWS EC2 dynamic(spot) instances.
>
> After replacing MySQL with REDIS now i wonder what would it take to 
> make slurm node addition | removal dynamic. I've been looking at the 
> source code for many months now and trying to decide if it can be done.
>
> I am using configless, 3 controllers, 2 slurmdbs with a redis sentinel 
> based robust backend.
>
> Steven
>
>
> On Thu., May 5, 2022, 08:57 Tina Friedrich, 
> <tina.friedrich at it.ox.ac.uk> wrote:
>
>     Hi List,
>
>     out of curiosity - I would assume that if running configless, one
>     doesn't manually need to restart slurmd on the nodes if the config
>     changes?
>
>     Hi Steven,
>
>     I have no idea if you want to do it every couple of minutes and
>     what the
>     implications are of that (although I've certainly manage to
>     restart them
>     every 5 minutes by accident with no real problems caused), but -
>     generally, restarting the daemons (slurmctld, slurmd) is a
>     non-issue, as
>     it's a safe operation. There's no risk to running jobs or anything. I
>     have the config management restart them if any files change. It also
>     doesn't seem to matter if the restarts of the controller & the node
>     daemons are splayed a bit (i.e. don't happen at the same time), or
>     what
>     order they happen in.
>
>     Tina
>
>     On 05/05/2022 13:17, Steven Varga wrote:
>     > Thank you for the quick reply! I know I am pushing my luck here:
>     is it
>     > possible to modify slurm: src/common/[read_conf.c, node_conf.c]
>     > src/slurmctld/[read_config.c, ...] such that the state can be
>     maintained
>     > dynamically? -- or cheaper to write a job manager with less
>     features but
>     > supporting dynamic nodes from ground up?
>     > best wishes: steve
>     >
>     > On Thu, May 5, 2022 at 12:29 AM Christopher Samuel
>     <chris at csamuel.org
>     > <mailto:chris at csamuel.org>> wrote:
>     >
>     >     On 5/4/22 7:26 pm, Steven Varga wrote:
>     >
>     >      > I am wondering what is the best way to update node
>     changes, such as
>     >      > addition and removal of nodes to SLURM. The excerpts
>     below suggest a
>     >      > full restart, can someone confirm this?
>     >
>     >     You are correct, you need to restart slurmctld and slurmd
>     daemons at
>     >     present.  See https://slurm.schedmd.com/faq.html#add_nodes
>     >     <https://slurm.schedmd.com/faq.html#add_nodes>
>     >
>     >     All the best,
>     >     Chris
>     >     --
>     >     Chris Samuel  : http://www.csamuel.org/
>     <http://www.csamuel.org/>
>     >     :  Berkeley, CA, USA
>     >
>
>     -- 
>     Tina Friedrich, Advanced Research Computing Snr HPC Systems
>     Administrator
>
>     Research Computing and Support Services
>     IT Services, University of Oxford
>     http://www.arc.ox.ac.uk http://www.it.ox.ac.uk
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20220505/daa6dad2/attachment.htm>


More information about the slurm-users mailing list