[slurm-users] GPU / cgroup challenges
wiegand at ist.ucf.edu
Wed May 2 09:15:37 MDT 2018
So there is a patch?
------ Original message------
From: Fulcomer, Samuel
Date: Wed, May 2, 2018 11:14
To: Slurm User Community List;
Subject:Re: [slurm-users] GPU / cgroup challenges
This came up around 12/17, I think, and as I recall the fixes were added to the src repo then; however, they weren't added to any fo the 17.releases.
On Wed, May 2, 2018 at 6:04 AM, R. Paul Wiegand <rpwiegand at gmail.com<mailto:rpwiegand at gmail.com>> wrote:
I dug into the logs on both the slurmctld side and the slurmd side.
For the record, I have debug2 set for both and
I cannot see much that is terribly relevant in the logs. There's a
known parameter error reported with the memory cgroup specifications,
but I don't think that is germane.
When I set "--gres=gpu:1", the slurmd log does have encouraging lines such as:
[2018-05-02T08:47:04.916] [203.0] debug: Allowing access to device
/dev/nvidia0 for job
[2018-05-02T08:47:04.916] [203.0] debug: Not allowing access to
device /dev/nvidia1 for job
However, I can still "see" both devices from nvidia-smi, and I can
still access both if I manually unset CUDA_VISIBLE_DEVICES.
When I do *not* specify --gres at all, there is no reference to gres,
gpu, nvidia, or anything similar in any log at all. And, of course, I
have full access to both GPUs.
I am happy to attach the snippets of the relevant logs, if someone
more knowledgeable wants to pour through them. I can also set the
debug level higher, if you think that would help.
Assuming upgrading will solve our problem, in the meantime: Is there
a way to ensure that the *default* request always has "--gres=gpu:1"?
That is, this situation is doubly bad for us not just because there is
*a way* around the resource management of the device but also because
the *DEFAULT* behavior if a user issues an srun/sbatch without
specifying a Gres is to go around the resource manager.
On Tue, May 1, 2018 at 8:29 PM, Christopher Samuel <chris at csamuel.org<mailto:chris at csamuel.org>> wrote:
> On 02/05/18 10:15, R. Paul Wiegand wrote:
>> Yes, I am sure they are all the same. Typically, I just scontrol
>> reconfig; however, I have also tried restarting all daemons.
> Understood. Any diagnostics in the slurmd logs when trying to start
> a GPU job on the node?
>> We are moving to 7.4 in a few weeks during our downtime. We had a
>> QDR -> OFED version constraint -> Lustre client version constraint
>> issue that delayed our upgrade.
> I feel your pain.. BTW RHEL 7.5 is out now so you'll need that if
> you need current security fixes.
>> Should I just wait and test after the upgrade?
> Well 17.11.6 will be out then that will include for a deadlock
> that some sites hit occasionally, so that will be worth throwing
> into the mix too. Do read the RELEASE_NOTES carefully though,
> especially if you're using slurmdbd!
> All the best,
> Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the slurm-users