Hello,

I have got a node in “drain” state after finishing a job that was running on it. Log in node reports this information:

[...]

[2025-09-07T11:09:26.980] task/affinity: task_p_slurmd_batch_request: task_p_slurmd_batch_request: 59238

[2025-09-07T11:09:26.980] task/affinity: batch_bind: job 59238 CPU input mask for node: 0xFFF

[2025-09-07T11:09:26.980] task/affinity: batch_bind: job 59238 CPU final HW mask for node: 0xFFF

[2025-09-07T11:09:26.980] Launching batch job 59238 for UID 21310

[2025-09-07T11:09:27.006] cred/munge: init: Munge credential signature plugin loaded

[2025-09-07T11:09:27.007] [59238.batch] debug: auth/munge: init: loaded

[2025-09-07T11:09:27.009] [59238.batch] debug: Reading cgroup.conf file /soft/slurm-23.11.0/etc/cgroup.conf

[2025-09-07T11:09:27.025] [59238.batch] debug: cgroup/v1: init: Cgroup v1 plugin loaded

[2025-09-07T11:09:27.025] [59238.batch] debug: hash/k12: init: init: KangarooTwelve hash plugin loaded

[2025-09-07T11:09:27.026] [59238.batch] debug: task/cgroup: init: core enforcement enabled

[2025-09-07T11:09:27.026] [59238.batch] debug: task/cgroup: init: device enforcement enabled

[2025-09-07T11:09:27.026] [59238.batch] debug: task/cgroup: init: Tasks containment cgroup plugin loaded

[2025-09-07T11:09:27.026] [59238.batch] task/affinity: init: task affinity plugin loaded with CPU mask 0xfff

[2025-09-07T11:09:27.027] [59238.batch] debug: jobacct_gather/cgroup: init: Job accounting gather cgroup plugin loaded

[2025-09-07T11:09:27.027] [59238.batch] topology/default: init: topology Default plugin loaded

[2025-09-07T11:09:27.030] [59238.batch] debug: gpu/generic: init: init: GPU Generic plugin loaded

[2025-09-07T11:09:27.031] [59238.batch] debug: laying out the 12 tasks on 1 hosts clus09 dist 2

[2025-09-07T11:09:27.031] [59238.batch] debug: close_slurmd_conn: sending 0: No error

[2025-09-07T11:09:27.031] [59238.batch] debug: Message thread started pid = 910040

[2025-09-07T11:09:27.031] [59238.batch] debug: Setting slurmstepd(910040) oom_score_adj to -1000

[2025-09-07T11:09:27.031] [59238.batch] debug: spank: opening plugin stack /soft/slurm-23.11.0/etc/plugstack.conf

[2025-09-07T11:09:27.031] [59238.batch] debug: task/cgroup: task_cgroup_cpuset_create: job abstract cores are '0-11'

[2025-09-07T11:09:27.031] [59238.batch] debug: task/cgroup: task_cgroup_cpuset_create: step abstract cores are '0-11'

[2025-09-07T11:09:27.031] [59238.batch] debug: task/cgroup: task_cgroup_cpuset_create: job physical CPUs are '0-11'

[2025-09-07T11:09:27.031] [59238.batch] debug: task/cgroup: task_cgroup_cpuset_create: step physical CPUs are '0-11'

[2025-09-07T11:09:27.090] [59238.batch] debug levels are stderr='error', logfile='debug', syslog='fatal'

[2025-09-07T11:09:27.090] [59238.batch] starting 1 tasks

[2025-09-07T11:09:27.090] [59238.batch] task 0 (910044) started 2025-09-07T11:09:27

[2025-09-07T11:09:27.098] [59238.batch] debug: task/affinity: task_p_pre_launch: affinity StepId=59238.batch, task:0 bind:mask_cpu

[2025-09-07T11:09:27.098] [59238.batch] _set_limit: RLIMIT_NPROC : reducing req:255366 to max:159631

[2025-09-07T11:09:27.398] [59238.batch] task 0 (910044) exited with exit code 2.

[2025-09-07T11:09:27.399] [59238.batch] debug: task/affinity: task_p_post_term: affinity StepId=59238.batch, task 0

[2025-09-07T11:09:27.399] [59238.batch] debug: signaling condition

[2025-09-07T11:09:27.399] [59238.batch] debug: jobacct_gather/cgroup: fini: Job accounting gather cgroup plugin unloaded

[2025-09-07T11:09:27.400] [59238.batch] debug: task/cgroup: fini: Tasks containment cgroup plugin unloaded

[2025-09-07T11:09:27.400] [59238.batch] job 59238 completed with slurm_rc = 0, job_rc = 512

[2025-09-07T11:09:27.410] [59238.batch] debug: Message thread exited

[2025-09-07T11:09:27.410] [59238.batch] stepd_cleanup: done with step (rc[0x200]:Unknown error 512, cleanup_rc[0x0]:No error)

[2025-09-07T11:09:27.411] debug: _rpc_terminate_job: uid = 1000 JobId=59238

[2025-09-07T11:09:27.411] debug: credential for job 59238 revoked

[...]

“sinfo” shows:

[root@login-node ~]# sinfo

PARTITION TIMELIMIT AVAIL STATE NODELIST CPU_LOAD NODES(A/I) NODES(A/I/O/T) CPUS CPUS(A/I/O/T) REASON

node.q* 4:00:00 up drained clus09 0.00 0/0 0/0/1/1 12 0/0/12/12 Kill task faile

node.q* 4:00:00 up allocated clus[10-11] 13.82-15.8 2/0 2/0/0/2 12 24/0/0/24 none

node.q* 4:00:00 up idle clus[01-06,12] 0.00 0/7 0/7/0/7 12 0/84/0/84 none

But it seems there is no error in node... Slurmctld.log in server seems correct, too.

Is there any way to reset node to “state=idle” after errors in the same way?

Thanks.