[slurm-users] Jobs escaping cgroup device controls after some amount of time.
Shawn Bobbin
sabobbin at umiacs.umd.edu
Thu Apr 12 07:23:07 MDT 2018
Hi,
We’re running slurm 17.11.5 on RHEL 7 and have been having issues with jobs escaping there cgroup controls on GPU devices.
For example we have the following steps running:
# ps auxn | grep [s]lurmstepd
0 2380 0.0 0.0 538436 3700 ? Sl 07:22 0:02 slurmstepd: [46609.0]
0 5714 0.0 0.0 472136 3952 ? Sl Apr11 0:03 slurmstepd: [46603.0]
0 17202 0.0 0.0 538448 3724 ? Sl Apr11 0:03 slurmstepd: [46596.0]
0 28673 0.0 0.0 538380 3696 ? Sl Apr10 0:39 slurmstepd: [46262.0]
0 44832 0.0 0.0 538640 3964 ? Sl Apr11 1:12 slurmstepd: [46361.0]
But not all of those are reflected in the cgroup device hierarchy:
# lscgroup | grep devices | grep slurm
devices:/slurm
devices:/slurm/uid_2093
devices:/slurm/uid_2093/job_46609
devices:/slurm/uid_2093/job_46609/step_0
devices:/slurm/uid_11477
devices:/slurm/uid_11477/job_46603
devices:/slurm/uid_11477/job_46603/step_0
devices:/slurm/uid_11184
devices:/slurm/uid_11184/job_46596
devices:/slurm/uid_11184/job_46596/step_0
This issue only seems to happen after a job has been running for a while, as when it is first started the cgroup controls work as expected. In this example, the jobs that have escaped the controls (46361,46362) have been running for over a day:
# squeue -j 46609,46603,46596,46262,46361
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
46596 dpart bash yhng R 10:56:00 1 vulcan14
46609 scavenger bash yaser R 1:52:37 1 vulcan14
46603 scavenger bash jxzheng R 9:47:26 1 vulcan14
46361 dpart bash jxzheng R 1-08:31:14 1 vulcan14
46262 dpart Weighted umahbub R 1-18:07:07 1 vulcan14
So it seems that at some point slurm, or something else, comes in and modifies the cgroup hierarchy, but we haven’t had much luck in tracking down what.
Has anyone run into this, or have any pointers for troubleshooting this further?
Thanks,
—Shawn
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20180412/bf5b293d/attachment-0001.html>
More information about the slurm-users
mailing list