[slurm-users] gres:gpu managment
Daniel Vecerka
vecerka at fel.cvut.cz
Thu May 23 09:21:34 UTC 2019
I have tested deviceQuery in the sbatch again and it works now:
Device PCI Domain ID / Bus ID / location ID: 0 / 97 / 0
Device PCI Domain ID / Bus ID / location ID: 0 / 137 / 0
Device PCI Domain ID / Bus ID / location ID: 0 / 98 / 0
Device PCI Domain ID / Bus ID / location ID: 0 / 138 / 0
and Aaron is right, that cgroup refers the first allocated GPU as 0,
because CUDA_VISIBLE_DEVICES is still set to 0.
So IMHO documentation https://slurm.schedmd.com/gres.html is little bit
confusing.
I really don't know, where problem was, because when I've tried
yesterday, I think, that It didn't work or I've just lost my mind due
frustration.
Anyway, problem is solved.
Thanks, Daniel
On 23.05.2019 10:11, Daniel Vecerka wrote:
> Jobs ends on the same GPU. If I run CUDA deviceQuery in the sbatch I get:
>
> Device PCI Domain ID / Bus ID / location ID: 0 / 97 / 0
> Device PCI Domain ID / Bus ID / location ID: 0 / 97 / 0
> Device PCI Domain ID / Bus ID / location ID: 0 / 97 / 0
> Device PCI Domain ID / Bus ID / location ID: 0 / 97 / 0
>
> Our cgroup.conf :
>
> /etc/slurm/cgroup.conf
> CgroupAutomount=yes
> CgroupReleaseAgentDir="/etc/slurm/cgroup"
> ConstrainCores=yes
> ConstrainDevices=yes
> ConstrainRAMSpace=yes
>
>
> Daniel
>
> On 23.05.2019 9:54, Aaron Jackson wrote:
>> Do jobs actually end up on the same GPU though? cgroups will always
>> refer to the first allocated GPU as 0, so it is not unexpected for each
>> job have CUDA_VISIBLE_DEVICES set to 0. Make sure you have the following
>> in /etc/cgroup.conf
>>
>> ConstrainDevices=yes
>>
>> Aaron
>>
>>
>
>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 3726 bytes
Desc: S/MIME Cryptographic Signature
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20190523/5212b18b/attachment.bin>
More information about the slurm-users
mailing list