[slurm-users] Slurm 17.11.2: defunct slurmd process leaves a sleep in the step_extern cgroup
Alessandro Federico
a.federico at cineca.it
Tue Jan 30 10:01:59 MST 2018
Hi all,
we observe a lot of job which keep being in completing state until we kill the sleep process inside the step_extern cgroup.
In these cases what we see on the involved nodes is a defunct slurmd
[root at r113c18s01 ~]# ps --forest -lfe | egrep '[s]leep|[s]lurm'
1 S root 26867 1 0 80 0 - 891256 inet_c Jan23 ? 00:03:48 /usr/sbin/slurmd
1 Z root 25518 26867 0 80 0 - 0 exit 12:59 ? 00:00:00 \_ [slurmd] <defunct>
0 S root 25525 1 0 80 0 - 26974 hrtime 12:59 ? 00:00:00 sleep 1000000
[root at r113c18s01 ~]# cat /sys/fs/cgroup/cpuset/slurm/uid_29592/job_62379/step_extern/tasks
25525
we see from UNIX accounting logs that the step_extern slurmstepd died immediately
[root at r113c18s01 ~]# lastcomm --command slurmstepd | grep D
slurmstepd DX root __ 0.89 secs Tue Jan 30 12:59
[root at r113c18s01 ~]# dump-acct /var/account/pacct | grep 'Tue Jan 30 12:59' | grep slurm
slurmd |v3| 0.00| 0.00| 0.00| 0| 0|3565056.00| 0.00| 25518 26867|Tue Jan 30 12:59:48 2018
slurmstepd |v3| 31.00| 58.00| 93.00| 0| 0|199680.00| 0.00| 25519 1|Tue Jan 30 12:59:49 2018
So both the sleep and slurmstepd processes turn to be children of systemd (pid 1).
Slurmd reports
[root at r113c18s01 ~]# journalctl -u slurmd | grep 62379
Jan 30 12:59:48 r113c18s01 slurmd[26867]: task_p_slurmd_batch_request: 62379
Jan 30 12:59:48 r113c18s01 slurmd[26867]: task/affinity: job 62379 CPU input mask for node: 0xFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
Jan 30 12:59:48 r113c18s01 slurmd[26867]: task/affinity: job 62379 CPU final HW mask for node: 0xFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
Jan 30 12:59:48 r113c18s01 slurmd[26867]: debug: Waiting for job 62379's prolog to complete
Jan 30 12:59:48 r113c18s01 slurmd[26867]: debug: [job 62379] attempting to run prolog [/etc/slurm/prolog.d/create_local_tmpdir.sh]
Jan 30 12:59:48 r113c18s01 slurmd[26867]: _run_prolog: prolog with lock for job 62379 ran for 0 seconds
Jan 30 12:59:49 r113c18s01 slurmd[26867]: debug: _step_connect: connect() failed dir /var/spool/slurmd node r113c18s01 step 62379.4294967295 Connection refused
Jan 30 15:11:24 r113c18s01 slurmd[26867]: debug: _step_connect: connect() failed dir /var/spool/slurmd node r113c18s01 step 62379.4294967295 Connection refused
Jan 30 15:11:24 r113c18s01 slurmd[26867]: debug: Cleaned up stray socket /var/spool/slurmd/r113c18s01_62379.4294967295
Jan 30 17:00:13 r113c18s01 slurmd[26867]: Job 62379: timeout: sent SIGTERM to 0 active steps
Jan 30 17:00:13 r113c18s01 slurmd[26867]: debug: task_p_slurmd_release_resources: affinity jobid 62379
Jan 30 17:00:13 r113c18s01 slurmd[26867]: debug: credential for job 62379 revoked
Jan 30 17:00:13 r113c18s01 slurmd[26867]: debug: Waiting for job 62379's prolog to complete
Jan 30 17:04:33 r113c18s01 slurmd[26867]: debug: task_p_slurmd_release_resources: affinity jobid 62379
Jan 30 17:08:39 r113c18s01 slurmd[26867]: debug: task_p_slurmd_release_resources: affinity jobid 62379
Jan 30 17:12:57 r113c18s01 slurmd[26867]: debug: task_p_slurmd_release_resources: affinity jobid 62379
Jan 30 17:16:55 r113c18s01 slurmd[26867]: debug: task_p_slurmd_release_resources: affinity jobid 62379
Jan 30 17:21:02 r113c18s01 slurmd[26867]: debug: task_p_slurmd_release_resources: affinity jobid 62379
Jan 30 17:25:10 r113c18s01 slurmd[26867]: debug: task_p_slurmd_release_resources: affinity jobid 62379
Jan 30 17:29:11 r113c18s01 slurmd[26867]: debug: task_p_slurmd_release_resources: affinity jobid 62379
Jan 30 17:33:21 r113c18s01 slurmd[26867]: debug: task_p_slurmd_release_resources: affinity jobid 62379
Jan 30 17:37:24 r113c18s01 slurmd[26867]: debug: task_p_slurmd_release_resources: affinity jobid 62379
Jan 30 17:41:26 r113c18s01 slurmd[26867]: debug: task_p_slurmd_release_resources: affinity jobid 62379
Jan 30 17:45:28 r113c18s01 slurmd[26867]: debug: task_p_slurmd_release_resources: affinity jobid 62379
Jan 30 17:49:35 r113c18s01 slurmd[26867]: debug: task_p_slurmd_release_resources: affinity jobid 62379
Jan 30 17:53:37 r113c18s01 slurmd[26867]: debug: task_p_slurmd_release_resources: affinity jobid 62379
We tried to setup an UnkillableStepProgram to kill the sleep process but the script is not invoked, we guess because
the slurmd is defunct.
Any idea?
Thanks
ale
--
Alessandro Federico
HPC System Management Group
System & Technology Department
CINECA www.cineca.it
Via dei Tizii 6, 00185 Rome - Italy
phone: +39 06 44486708
All work and no play makes Jack a dull boy.
All work and no play makes Jack a dull boy.
All work and no play makes Jack...
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20180130/64246aa0/attachment.html>
More information about the slurm-users
mailing list