Hi,
After reading answer from Ole Holm Nielsen, I have increased "MessageTimeout" to 20s (by default is 5s) and "UnkillableStepTimeout" to 150s (by default is 60s and, always 5 times larger than "MessageTimeout"). However, I have also read that UnkillableStepProgram indicates the program to use in that cases... but, by default there is no program assigned to that parameter (no program to run). So my question is if someone uses a customized "UnkillableStepProgram" and if he/she could explain that.
Thanks a lot!
Hello,
as an example, my UnkillableStepProgram is just a bash script collecting recent logs and processes and mailing me about the error. Nothing special.
Best regards, -- *Lorenzo Bosio* Tecnico di Ricerca - Laboratorio HPC4AI Dipartimento di Informatica
Università degli Studi di Torino Corso Svizzera, 185 - 10149 Torino tel. +39 340 216 8249 tel. +39 011 670 6836
Il giorno gio 18 set 2025 alle ore 12:22 Gestió Servidors via slurm-users < slurm-users@lists.schedmd.com> ha scritto:
Hi,
After reading answer from Ole Holm Nielsen, I have increased “MessageTimeout” to 20s (by default is 5s) and “UnkillableStepTimeout” to 150s (by default is 60s and, always 5 times larger than “MessageTimeout”). However, I have also read that UnkillableStepProgram indicates the program to use in that cases... but, by default there is no program assigned to that parameter (no program to run). So my question is if someone uses a customized “UnkillableStepProgram” and if he/she could explain that.
Thanks a lot!
-- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-leave@lists.schedmd.com
On 9/18/25 12:39, Lorenzo Bosio via slurm-users wrote:
as an example, my UnkillableStepProgram is just a bash script collecting recent logs and processes and mailing me about the error. Nothing special.
We use Slurm "triggers" to get alerts from many different types of events, see https://github.com/OleHolmNielsen/Slurm_tools/tree/master/triggers
Relevant here is the "notify_nodes_drained" Trigger script for node drained state
We don't use an UnkillableStepProgram. In my experience the *Kill task failed* events discussed earlier in this thread require a manual examination of why the job failed to die, and I think it will be hard to write a script to examine all kinds of possible errors.
The most common scenario is stale I/O from the job to a network file server, and I described in a previous post how we deal with this.
BTW we use this parameter: UnkillableStepTimeout = 180 sec
Il giorno gio 18 set 2025 alle ore 12:22 Gestió Servidors via slurm-users <slurm-users@lists.schedmd.com mailto:slurm-users@lists.schedmd.com> ha scritto:
After reading answer from Ole Holm Nielsen, I have increased “MessageTimeout” to 20s (by default is 5s) and “UnkillableStepTimeout” to 150s (by default is 60s and, always 5 times larger than “MessageTimeout”). However, I have also read that UnkillableStepProgram indicates the program to use in that cases... but, by default there is no program assigned to that parameter (no program to run). So my question is if someone uses a customized “UnkillableStepProgram” and if he/she could explain that.____
IHTH, Ole