[slurm-users] Slurmstepd sleep processes
Jeffrey T Frey
frey at udel.edu
Fri Aug 3 21:29:38 MDT 2018
See:
https://github.com/SchedMD/slurm/blob/master/src/slurmd/slurmstepd/mgr.c
Circa line 1072 the comment explains:
/*
* Need to exec() something for proctrack/linuxproc to
* work, it will not keep a process named "slurmstepd"
*/
execl(SLEEP_CMD, "sleep", "100000000", NULL);
Basically, proctrack/linuxproc will produce an error if a slurmstepd is running zero subprocesses. So a very long sleep command is spawned to satisfy that condition (no matter what proctrack plugin is actually being used).
> On Aug 3, 2018, at 17:42 , Christopher Benjamin Coffey <Chris.Coffey at nau.edu> wrote:
>
> Hello,
>
> Has anyone observed "sleep 100000000" processes on their compute nodes? They seem to be tied to the slurmstepd extern process in slurm:
>
> 4 S root 136777 1 0 80 0 - 73218 do_wai 05:48 ? 00:00:01 slurmstepd: [13220317.extern]
> 0 S root 136782 136777 0 80 0 - 25229 hrtime 05:48 ? 00:00:00 \_ sleep 100000000
> 4 S root 136784 1 0 80 0 - 73280 do_wai 05:48 ? 00:00:02 slurmstepd: [13220317.batch]
> 4 S tes87 136789 136784 0 80 0 - 26520 do_wai 05:48 ? 00:00:00 \_ /bin/bash /var/spool/slurm/slurmd/job13220317/slurm_script
> 4 S root 136807 1 0 80 0 - 107157 do_wai 05:48 ? 00:00:01 slurmstepd: [13220317.1]
>
> I'm not exactly sure what the extern piece is for. Anyone know what this is all about? Is this normal? We just saw this the other day while investigating some issues. Sleeping for 3.17 years seems strange. Any help would be appreciated, thanks!
>
> Best,
> Chris
>
> —
> Christopher Coffey
> High-Performance Computing
> Northern Arizona University
> 928-523-1167
>
>
More information about the slurm-users
mailing list