[slurm-users] Nodes not returning from DRAINING

Wed Oct 28 12:11:01 UTC 2020

May try with this workaround

scontrol update NodeName=<node name> State=IDLE

Thanks & Regards,
Sudeep Narayan Banerjee
System Analyst | Scientist B
Information System and Technology Facility
Indian Institute of Technology Gandhinagar
Palaj, Gujarat 382355, INDIA

On Wed, Oct 28, 2020 at 5:41 PM Diego Zuccato <diego.zuccato at unibo.it>
wrote:

> Hello all.
>
> I've found that sometimes, some jobs leave the nodes in DRAINING state.
>
> In slurmctld.log I find:
> -8<--
> [2020-10-28T11:30:16.999] update_node: node str957-mtx-11 reason set to:
> Kill task failed
> [2020-10-28T11:30:16.999] update_node: node str957-mtx-11 state set to
> DRAINING
> -8<--
> while on the node (slurmd.log):
> -8<--
> [2020-10-28T11:24:11.980] [8975.0] task/cgroup:
> /slurm_str957-mtx-11/uid_2126297435/job_8975: alloc=117600MB
> mem.limit=117600MB memsw.limit=117600MB
> [2020-10-28T11:24:11.980] [8975.0] task/cgroup:
> /slurm_str957-mtx-11/uid_2126297435/job_8975/step_0: alloc=117600MB
> mem.limit=117600MB memsw.limit=117600MB
> [2020-10-28T11:29:18.926] [8975.0] Defering sending signal, processes in
> job are currently core dumping
> [2020-10-28T11:30:17.000] [8975.0] error: *** STEP 8975.0 STEPD
> TERMINATED ON str957-mtx-11 AT 2020-10-28T11:30:16 DUE TO JOB NOT ENDING
> WITH SIGNALS ***
> [2020-10-28T11:30:19.306] [8975.0] done with job
> -8<--
>
> Seems slurmd takes a bit too much time to close the job. Is there some
> timeout I could change to avoid having to fix it manually?
>
> TIA.
>
> --
> Diego Zuccato
> DIFA - Dip. di Fisica e Astronomia
> Servizi Informatici
> Alma Mater Studiorum - Università di Bologna
> V.le Berti-Pichat 6/2 - 40127 Bologna - Italy
> tel.: +39 051 20 95786
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20201028/241d410e/attachment.htm>