[slurm-users] GPU / cgroup challenges

Wed May 2 09:11:46 MDT 2018

This came up around 12/17, I think, and as I recall the fixes were added to
the src repo then; however, they weren't added to any fo the 17.releases.

On Wed, May 2, 2018 at 6:04 AM, R. Paul Wiegand <rpwiegand at gmail.com> wrote:

> I dug into the logs on both the slurmctld side and the slurmd side.
> For the record, I have debug2 set for both and
> DebugFlags=CPU_BIND,Gres.
>
> I cannot see much that is terribly relevant in the logs.  There's a
> known parameter error reported with the memory cgroup specifications,
> but I don't think that is germane.
>
> When I set "--gres=gpu:1", the slurmd log does have encouraging lines such
> as:
>
> [2018-05-02T08:47:04.916] [203.0] debug:  Allowing access to device
> /dev/nvidia0 for job
> [2018-05-02T08:47:04.916] [203.0] debug:  Not allowing access to
> device /dev/nvidia1 for job
>
> However, I can still "see" both devices from nvidia-smi, and I can
> still access both if I manually unset CUDA_VISIBLE_DEVICES.
>
> When I do *not* specify --gres at all, there is no reference to gres,
> gpu, nvidia, or anything similar in any log at all.  And, of course, I
> have full access to both GPUs.
>
> I am happy to attach the snippets of the relevant logs, if someone
> more knowledgeable wants to pour through them.  I can also set the
> debug level higher, if you think that would help.
>
>
> Assuming upgrading will solve our problem, in the meantime:  Is there
> a way to ensure that the *default* request always has "--gres=gpu:1"?
> That is, this situation is doubly bad for us not just because there is
> *a way* around the resource management of the device but also because
> the *DEFAULT* behavior if a user issues an srun/sbatch without
> specifying a Gres is to go around the resource manager.
>
>
>
> On Tue, May 1, 2018 at 8:29 PM, Christopher Samuel <chris at csamuel.org>
> wrote:
> > On 02/05/18 10:15, R. Paul Wiegand wrote:
> >
> >> Yes, I am sure they are all the same.  Typically, I just scontrol
> >> reconfig; however, I have also tried restarting all daemons.
> >
> >
> > Understood. Any diagnostics in the slurmd logs when trying to start
> > a GPU job on the node?
> >
> >> We are moving to 7.4 in a few weeks during our downtime.  We had a
> >> QDR -> OFED version constraint -> Lustre client version constraint
> >> issue that delayed our upgrade.
> >
> >
> > I feel your pain..  BTW RHEL 7.5 is out now so you'll need that if
> > you need current security fixes.
> >
> >> Should I just wait and test after the upgrade?
> >
> >
> > Well 17.11.6 will be out then that will include for a deadlock
> > that some sites hit occasionally, so that will be worth throwing
> > into the mix too.   Do read the RELEASE_NOTES carefully though,
> > especially if you're using slurmdbd!
> >
> >
> > All the best,
> > Chris
> > --
> >  Chris Samuel  :  http://www.csamuel.org/  :  Melbourne, VIC
> >
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20180502/2bc7c63c/attachment.html>