Re: Node in drain state

List overview All Threads
Download

newer

older

Compute node not responding

seff for GPU

Gestió Servidors

18 Sep 2025 18 Sep '25

3:11 a.m.

Hi,

After reading answer from Ole Holm Nielsen, I have increased "MessageTimeout" to 20s (by default is 5s) and "UnkillableStepTimeout" to 150s (by default is 60s and, always 5 times larger than "MessageTimeout"). However, I have also read that UnkillableStepProgram indicates the program to use in that cases... but, by default there is no program assigned to that parameter (no program to run). So my question is if someone uses a customized "UnkillableStepProgram" and if he/she could explain that.

Thanks a lot!

Attachments:

attachment.html (text/html — 2.3 KB)

Show replies by date

Lorenzo Bosio

18 Sep 18 Sep

3:39 a.m.

New subject: Node in drain state

Hello,

as an example, my UnkillableStepProgram is just a bash script collecting recent logs and processes and mailing me about the error. Nothing special.

Best regards, -- *Lorenzo Bosio* Tecnico di Ricerca - Laboratorio HPC4AI Dipartimento di Informatica

Università degli Studi di Torino Corso Svizzera, 185 - 10149 Torino tel. +39 340 216 8249 tel. +39 011 670 6836

Il giorno gio 18 set 2025 alle ore 12:22 Gestió Servidors via slurm-users < slurm-users@lists.schedmd.com> ha scritto:

...

Hi,

After reading answer from Ole Holm Nielsen, I have increased “MessageTimeout” to 20s (by default is 5s) and “UnkillableStepTimeout” to 150s (by default is 60s and, always 5 times larger than “MessageTimeout”). However, I have also read that UnkillableStepProgram indicates the program to use in that cases... but, by default there is no program assigned to that parameter (no program to run). So my question is if someone uses a customized “UnkillableStepProgram” and if he/she could explain that.

Thanks a lot!

-- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-leave@lists.schedmd.com

Ole Holm Nielsen

19 Sep 19 Sep

12:01 a.m.

New subject: Node in drain state

On 9/18/25 12:39, Lorenzo Bosio via slurm-users wrote:

...

as an example, my UnkillableStepProgram is just a bash script collecting recent logs and processes and mailing me about the error. Nothing special.

We use Slurm "triggers" to get alerts from many different types of events, see https://github.com/OleHolmNielsen/Slurm_tools/tree/master/triggers

Relevant here is the "notify_nodes_drained" Trigger script for node drained state

We don't use an UnkillableStepProgram. In my experience the *Kill task failed* events discussed earlier in this thread require a manual examination of why the job failed to die, and I think it will be hard to write a script to examine all kinds of possible errors.

The most common scenario is stale I/O from the job to a network file server, and I described in a previous post how we deal with this.

BTW we use this parameter: UnkillableStepTimeout = 180 sec

...

Il giorno gio 18 set 2025 alle ore 12:22 Gestió Servidors via slurm-users <slurm-users@lists.schedmd.com mailto:slurm-users@lists.schedmd.com> ha scritto:

After reading answer from Ole Holm Nielsen, I have increased
“MessageTimeout” to 20s (by default is 5s) and “UnkillableStepTimeout”
to 150s (by default is 60s and, always 5 times larger than
“MessageTimeout”). However, I have also read that
UnkillableStepProgram indicates the program to use in that cases...
but, by default there is no program assigned to that parameter (no
program to run). So my question is if someone uses a customized
“UnkillableStepProgram” and if he/she could explain that.____

IHTH, Ole

166

Age (days ago)

167

Last active (days ago)

slurm-users@lists.schedmd.com

2 comments

3 participants

tags (0)

participants (3)

Gestió Servidors
Lorenzo Bosio
Ole Holm Nielsen