[slurm-users] Jobs escaping cgroup device controls after some amount of time.

Shawn Bobbin sabobbin at umiacs.umd.edu
Mon Apr 23 12:40:58 MDT 2018


Hi,

I attached our cgroup.conf and gres.conf.  

As for the cgroup_allowed_devices.conf file, I have this file stubbed but empty.  In 17.02 slurm started fine without this file (as far as I remember) and it being empty doesn’t appear to actually impact anything… device availability remains the same.  Based on the behavior explained in [0] I don’t expect this file to impact specific GPU containment. 

TaskPlugin = task/cgroup
ProctrackType = proctrack/cgroup
JobAcctGatherType = jobacct_gather/cgroup



[0] https://bugs.schedmd.com/show_bug.cgi?id=4122






> On Apr 13, 2018, at 9:25 AM, Kevin Manalo <kmanalo at jhu.edu> wrote:
> 
> I’m asking in the hopes that others will chime in (I’m curious why this is happening)
>  
> Could you share your related slurm.conf cgroup options
>  
> cgroup.conf
> cgroup_allowed_devices_file.conf
>  
> TaskPlugin
> ProctrackType
> JobAcctGatherType
>  
> -Kevin
>  
> PS Looking for similar style jobs, We have >1 day gpu users inside of cgroup, but not multi-tenant currently. 17.11.5, CentOS6
>  
> From: slurm-users <slurm-users-bounces at lists.schedmd.com> on behalf of Shawn Bobbin <sabobbin at umiacs.umd.edu>
> Reply-To: Slurm User Community List <slurm-users at lists.schedmd.com>
> Date: Thursday, April 12, 2018 at 9:25 AM
> To: "slurm-users at schedmd.com" <slurm-users at schedmd.com>
> Subject: [slurm-users] Jobs escaping cgroup device controls after some amount of time.
>  
> Hi,  <>
>  
> We’re running slurm 17.11.5 on RHEL 7 and have been having issues with jobs escaping there cgroup controls on GPU devices.
>  
>  
> For example we have the following steps running:
>  
> # ps auxn | grep [s]lurmstepd
>        0  2380  0.0  0.0 538436  3700 ?        Sl   07:22   0:02 slurmstepd: [46609.0]
>        0  5714  0.0  0.0 472136  3952 ?        Sl   Apr11   0:03 slurmstepd: [46603.0]
>        0 17202  0.0  0.0 538448  3724 ?        Sl   Apr11   0:03 slurmstepd: [46596.0]
>        0 28673  0.0  0.0 538380  3696 ?        Sl   Apr10   0:39 slurmstepd: [46262.0]
>        0 44832  0.0  0.0 538640  3964 ?        Sl   Apr11   1:12 slurmstepd: [46361.0]
>  
>  
> But not all of those are reflected in the cgroup device hierarchy:
>  
> # lscgroup | grep devices | grep slurm
> devices:/slurm
> devices:/slurm/uid_2093
> devices:/slurm/uid_2093/job_46609
> devices:/slurm/uid_2093/job_46609/step_0
> devices:/slurm/uid_11477
> devices:/slurm/uid_11477/job_46603
> devices:/slurm/uid_11477/job_46603/step_0
> devices:/slurm/uid_11184
> devices:/slurm/uid_11184/job_46596
> devices:/slurm/uid_11184/job_46596/step_0
>  
>  
> This issue only seems to happen after a job has been running for a while, as when it is first started the cgroup controls work as expected.  In this example, the jobs that have escaped the controls (46361,46362) have been running for over a day:
>  
> # squeue -j 46609,46603,46596,46262,46361
>              JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
>              46596     dpart     bash     yhng  R   10:56:00      1 vulcan14
>              46609 scavenger     bash    yaser  R    1:52:37      1 vulcan14
>              46603 scavenger     bash  jxzheng  R    9:47:26      1 vulcan14
>              46361     dpart     bash  jxzheng  R 1-08:31:14      1 vulcan14
>              46262     dpart Weighted  umahbub  R 1-18:07:07      1 vulcan14
>  
>  
> So it seems that at some point slurm, or something else, comes in and modifies the cgroup hierarchy, but we haven’t had much luck in tracking down what.
>  
> Has anyone run into this, or have any pointers for troubleshooting this further?
>  
> Thanks,
> —Shawn

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20180423/2ec98b85/attachment-0003.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: slurm.conf
Type: application/octet-stream
Size: 5496 bytes
Desc: not available
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20180423/2ec98b85/attachment-0002.obj>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20180423/2ec98b85/attachment-0004.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: gres.conf
Type: application/octet-stream
Size: 303 bytes
Desc: not available
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20180423/2ec98b85/attachment-0003.obj>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20180423/2ec98b85/attachment-0005.html>


More information about the slurm-users mailing list