[slurm-users] Jobs escaping cgroup device controls after some amount of time.
Shawn Bobbin
sabobbin at umiacs.umd.edu
Mon Apr 23 12:40:58 MDT 2018
Hi,
I attached our cgroup.conf and gres.conf.
As for the cgroup_allowed_devices.conf file, I have this file stubbed but empty. In 17.02 slurm started fine without this file (as far as I remember) and it being empty doesn’t appear to actually impact anything… device availability remains the same. Based on the behavior explained in [0] I don’t expect this file to impact specific GPU containment.
TaskPlugin = task/cgroup
ProctrackType = proctrack/cgroup
JobAcctGatherType = jobacct_gather/cgroup
[0] https://bugs.schedmd.com/show_bug.cgi?id=4122
> On Apr 13, 2018, at 9:25 AM, Kevin Manalo <kmanalo at jhu.edu> wrote:
>
> I’m asking in the hopes that others will chime in (I’m curious why this is happening)
>
> Could you share your related slurm.conf cgroup options
>
> cgroup.conf
> cgroup_allowed_devices_file.conf
>
> TaskPlugin
> ProctrackType
> JobAcctGatherType
>
> -Kevin
>
> PS Looking for similar style jobs, We have >1 day gpu users inside of cgroup, but not multi-tenant currently. 17.11.5, CentOS6
>
> From: slurm-users <slurm-users-bounces at lists.schedmd.com> on behalf of Shawn Bobbin <sabobbin at umiacs.umd.edu>
> Reply-To: Slurm User Community List <slurm-users at lists.schedmd.com>
> Date: Thursday, April 12, 2018 at 9:25 AM
> To: "slurm-users at schedmd.com" <slurm-users at schedmd.com>
> Subject: [slurm-users] Jobs escaping cgroup device controls after some amount of time.
>
> Hi, <>
>
> We’re running slurm 17.11.5 on RHEL 7 and have been having issues with jobs escaping there cgroup controls on GPU devices.
>
>
> For example we have the following steps running:
>
> # ps auxn | grep [s]lurmstepd
> 0 2380 0.0 0.0 538436 3700 ? Sl 07:22 0:02 slurmstepd: [46609.0]
> 0 5714 0.0 0.0 472136 3952 ? Sl Apr11 0:03 slurmstepd: [46603.0]
> 0 17202 0.0 0.0 538448 3724 ? Sl Apr11 0:03 slurmstepd: [46596.0]
> 0 28673 0.0 0.0 538380 3696 ? Sl Apr10 0:39 slurmstepd: [46262.0]
> 0 44832 0.0 0.0 538640 3964 ? Sl Apr11 1:12 slurmstepd: [46361.0]
>
>
> But not all of those are reflected in the cgroup device hierarchy:
>
> # lscgroup | grep devices | grep slurm
> devices:/slurm
> devices:/slurm/uid_2093
> devices:/slurm/uid_2093/job_46609
> devices:/slurm/uid_2093/job_46609/step_0
> devices:/slurm/uid_11477
> devices:/slurm/uid_11477/job_46603
> devices:/slurm/uid_11477/job_46603/step_0
> devices:/slurm/uid_11184
> devices:/slurm/uid_11184/job_46596
> devices:/slurm/uid_11184/job_46596/step_0
>
>
> This issue only seems to happen after a job has been running for a while, as when it is first started the cgroup controls work as expected. In this example, the jobs that have escaped the controls (46361,46362) have been running for over a day:
>
> # squeue -j 46609,46603,46596,46262,46361
> JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
> 46596 dpart bash yhng R 10:56:00 1 vulcan14
> 46609 scavenger bash yaser R 1:52:37 1 vulcan14
> 46603 scavenger bash jxzheng R 9:47:26 1 vulcan14
> 46361 dpart bash jxzheng R 1-08:31:14 1 vulcan14
> 46262 dpart Weighted umahbub R 1-18:07:07 1 vulcan14
>
>
> So it seems that at some point slurm, or something else, comes in and modifies the cgroup hierarchy, but we haven’t had much luck in tracking down what.
>
> Has anyone run into this, or have any pointers for troubleshooting this further?
>
> Thanks,
> —Shawn
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20180423/2ec98b85/attachment-0003.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: slurm.conf
Type: application/octet-stream
Size: 5496 bytes
Desc: not available
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20180423/2ec98b85/attachment-0002.obj>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20180423/2ec98b85/attachment-0004.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: gres.conf
Type: application/octet-stream
Size: 303 bytes
Desc: not available
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20180423/2ec98b85/attachment-0003.obj>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20180423/2ec98b85/attachment-0005.html>
More information about the slurm-users
mailing list