[slurm-users] Nodes not returning from DRAINING

Diego Zuccato diego.zuccato at unibo.it
Wed Oct 28 11:57:52 UTC 2020


Hello all.

I've found that sometimes, some jobs leave the nodes in DRAINING state.

In slurmctld.log I find:
-8<--
[2020-10-28T11:30:16.999] update_node: node str957-mtx-11 reason set to:
Kill task failed
[2020-10-28T11:30:16.999] update_node: node str957-mtx-11 state set to
DRAINING
-8<--
while on the node (slurmd.log):
-8<--
[2020-10-28T11:24:11.980] [8975.0] task/cgroup:
/slurm_str957-mtx-11/uid_2126297435/job_8975: alloc=117600MB
mem.limit=117600MB memsw.limit=117600MB
[2020-10-28T11:24:11.980] [8975.0] task/cgroup:
/slurm_str957-mtx-11/uid_2126297435/job_8975/step_0: alloc=117600MB
mem.limit=117600MB memsw.limit=117600MB
[2020-10-28T11:29:18.926] [8975.0] Defering sending signal, processes in
job are currently core dumping
[2020-10-28T11:30:17.000] [8975.0] error: *** STEP 8975.0 STEPD
TERMINATED ON str957-mtx-11 AT 2020-10-28T11:30:16 DUE TO JOB NOT ENDING
WITH SIGNALS ***
[2020-10-28T11:30:19.306] [8975.0] done with job
-8<--

Seems slurmd takes a bit too much time to close the job. Is there some
timeout I could change to avoid having to fix it manually?

TIA.

-- 
Diego Zuccato
DIFA - Dip. di Fisica e Astronomia
Servizi Informatici
Alma Mater Studiorum - Università di Bologna
V.le Berti-Pichat 6/2 - 40127 Bologna - Italy
tel.: +39 051 20 95786



More information about the slurm-users mailing list