[slurm-users] Slurm - UnkillableStepProgram
Stefan Staeglich
staeglis at informatik.uni-freiburg.de
Thu Jan 19 13:01:26 UTC 2023
Hi,
I'm wondering where the UnkillableStepProgram is actually executed. According
to Mike it has to be available on every on the compute nodes. This makes sense
only if it is executed there.
But the man page slurm.conf of 21.08.x states:
UnkillableStepProgram
Must be executable by user SlurmUser. The file must be
accessible by the primary and backup control machines.
So I would expect it's executed on the controller node.
Best,
Stefan
Am Dienstag, 23. März 2021, 05:30:01 CET schrieb Chris Samuel:
> Hi Mike,
>
> On 22/3/21 7:12 pm, Yap, Mike wrote:
> > # I presume UnkillableStepTimeout is set in slurm.conf. and it act as a
> > timer to trigger UnkillableStepProgram
>
> That is correct.
>
> > # UnkillableStepProgram can be use to send email or reboot compute node
> > – question is how do we configure it ?
>
> Also - or to automate collecting debug info (which is what we do) and
> then we manually intervene to reboot the node once we've determined
> there's no more useful info to collect.
>
> It's just configured in your slurm.conf.
>
> UnkillableStepProgram=/path/to/the/unkillable/step/script.sh
>
> Of course this script has to be present on every compute node.
>
> All the best,
> Chris
--
Stefan Stäglich, Universität Freiburg, Institut für Informatik
Georges-Köhler-Allee, Geb.52, 79110 Freiburg, Germany
E-Mail : staeglis at informatik.uni-freiburg.de
WWW : ml.informatik.uni-freiburg.de
Telefon: +49 761 203-8223
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 833 bytes
Desc: This is a digitally signed message part.
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20230119/42ea444e/attachment.sig>
More information about the slurm-users
mailing list