<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=us-ascii">
<meta content="text/html; charset=utf-8">
</head>
<body>
<div style="font-size:10pt">
<div dir="auto">So there is a patch?</div>
</div>
<div style="font-size:10pt">
<div dir="auto"><br>
</div>
<div dir="auto">------ Original message------</div>
<div dir="auto"><b>From: </b>Fulcomer, Samuel</div>
<div dir="auto"><b>Date: </b>Wed, May 2, 2018 11:14</div>
<div dir="auto"><b>To: </b>Slurm User Community List;</div>
<div dir="auto"><b>Cc: </b></div>
<div dir="auto"><b>Subject:</b>Re: [slurm-users] GPU / cgroup challenges</div>
<div dir="auto"><br>
</div>
</div>
<div>
<div dir="ltr">This came up around 12/17, I think, and as I recall the fixes were added to the src repo then; however, they weren't added to any fo the 17.releases. </div>
<div class="gmail_extra"><br>
<div class="gmail_quote">On Wed, May 2, 2018 at 6:04 AM, R. Paul Wiegand <span dir="ltr">
<<a href="mailto:rpwiegand@gmail.com" target="_blank">rpwiegand@gmail.com</a>></span> wrote:<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex; border-left:1px #ccc solid; padding-left:1ex">
I dug into the logs on both the slurmctld side and the slurmd side.<br>
For the record, I have debug2 set for both and<br>
DebugFlags=CPU_BIND,Gres.<br>
<br>
I cannot see much that is terribly relevant in the logs. There's a<br>
known parameter error reported with the memory cgroup specifications,<br>
but I don't think that is germane.<br>
<br>
When I set "--gres=gpu:1", the slurmd log does have encouraging lines such as:<br>
<br>
[2018-05-02T08:47:04.916] [203.0] debug: Allowing access to device<br>
/dev/nvidia0 for job<br>
[2018-05-02T08:47:04.916] [203.0] debug: Not allowing access to<br>
device /dev/nvidia1 for job<br>
<br>
However, I can still "see" both devices from nvidia-smi, and I can<br>
still access both if I manually unset CUDA_VISIBLE_DEVICES.<br>
<br>
When I do *not* specify --gres at all, there is no reference to gres,<br>
gpu, nvidia, or anything similar in any log at all. And, of course, I<br>
have full access to both GPUs.<br>
<br>
I am happy to attach the snippets of the relevant logs, if someone<br>
more knowledgeable wants to pour through them. I can also set the<br>
debug level higher, if you think that would help.<br>
<br>
<br>
Assuming upgrading will solve our problem, in the meantime: Is there<br>
a way to ensure that the *default* request always has "--gres=gpu:1"?<br>
That is, this situation is doubly bad for us not just because there is<br>
*a way* around the resource management of the device but also because<br>
the *DEFAULT* behavior if a user issues an srun/sbatch without<br>
specifying a Gres is to go around the resource manager.<br>
<br>
<br>
<br>
On Tue, May 1, 2018 at 8:29 PM, Christopher Samuel <<a href="mailto:chris@csamuel.org">chris@csamuel.org</a>> wrote:<br>
> On 02/05/18 10:15, R. Paul Wiegand wrote:<br>
><br>
>> Yes, I am sure they are all the same. Typically, I just scontrol<br>
>> reconfig; however, I have also tried restarting all daemons.<br>
><br>
><br>
> Understood. Any diagnostics in the slurmd logs when trying to start<br>
> a GPU job on the node?<br>
><br>
>> We are moving to 7.4 in a few weeks during our downtime. We had a<br>
>> QDR -> OFED version constraint -> Lustre client version constraint<br>
>> issue that delayed our upgrade.<br>
><br>
><br>
> I feel your pain.. BTW RHEL 7.5 is out now so you'll need that if<br>
> you need current security fixes.<br>
><br>
>> Should I just wait and test after the upgrade?<br>
><br>
><br>
> Well 17.11.6 will be out then that will include for a deadlock<br>
> that some sites hit occasionally, so that will be worth throwing<br>
> into the mix too. Do read the RELEASE_NOTES carefully though,<br>
> especially if you're using slurmdbd!<br>
><br>
><br>
> All the best,<br>
> Chris<br>
<span class="HOEnZb"><font color="#888888">> --<br>
> Chris Samuel : <a href="http://www.csamuel.org/" rel="noreferrer" target="_blank">
http://www.csamuel.org/</a> : Melbourne, VIC<br>
><br>
<br>
</font></span></blockquote>
</div>
<br>
</div>
</div>
</body>
</html>