[slurm-users] gres with docker problem
허웅
hoewoonggood at naver.com
Tue Jan 1 18:25:41 MST 2019
Hi
I'm using slurm with GRES(4 GPU).
I wanna allocate jobs uniformly through GRES(especially GPU). But, It does not work when I use Docker.
For example,
If i run this command for 4 times with different tty, I could get what i want to get.
As you can see, All Bus-Ids are different.
#1
$ srun --gres=gpu:1 --gres-flags=enforce-binding --cpus-per-task=8 --mem=20G --pty bash
$ nvidia-smi
Wed Jan 2 01:02:00 2019
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 410.79 Driver Version: 410.79 CUDA Version: 10.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla P100-PCIE... On | 00000000:14:00.0 Off | 0 |
| N/A 30C P0 27W / 250W | 0MiB / 16280MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
#2
$ srun --gres=gpu:1 --gres-flags=enforce-binding --cpus-per-task=8 --mem=20G --pty bash
$ nvidia-smi
Wed Jan 2 01:02:39 2019
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 410.79 Driver Version: 410.79 CUDA Version: 10.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla P100-PCIE... On | 00000000:15:00.0 Off | 0 |
| N/A 32C P0 26W / 250W | 0MiB / 16280MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
#3
$ srun --gres=gpu:1 --gres-flags=enforce-binding --cpus-per-task=8 --mem=20G --pty bash
$ nvidia-smi
Wed Jan 2 00:36:22 2019
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 410.79 Driver Version: 410.79 CUDA Version: 10.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla P100-PCIE... On | 00000000:39:00.0 Off | 0 |
| N/A 30C P0 27W / 250W | 0MiB / 16280MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
#4
$ srun --gres=gpu:1 --gres-flags=enforce-binding --cpus-per-task=8 --mem=20G --pty bash
$ nvidia-smi
Wed Jan 2 01:03:50 2019
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 410.79 Driver Version: 410.79 CUDA Version: 10.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla P100-PCIE... On | 00000000:3A:00.0 Off | 0 |
| N/A 29C P0 27W / 250W | 0MiB / 16280MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
Also scontrol show Command is OK.
All of GRES_IDXs are different.
$ scontrol show job=472 --details
JobId=472 JobName=bash
UserId=root(0) GroupId=root(0) MCS_label=N/A
Priority=4294901759 Nice=0 Account=(null) QOS=(null)
JobState=RUNNING Reason=None Dependency=(null)
Requeue=1 Restarts=0 BatchFlag=0 Reboot=0 ExitCode=0:0
DerivedExitCode=0:0
RunTime=00:29:12 TimeLimit=UNLIMITED TimeMin=N/A
SubmitTime=2019-01-02T00:35:37 EligibleTime=2019-01-02T00:35:37
StartTime=2019-01-02T00:35:37 EndTime=Unknown Deadline=N/A
PreemptTime=None SuspendTime=None SecsPreSuspend=0
Partition=all AllocNode:Sid=...:30423
ReqNodeList=(null) ExcNodeList=(null)
NodeList=...
BatchHost=...
NumNodes=1 NumCPUs=8 NumTasks=1 CPUs/Task=8 ReqB:S:C:T=0:0:*:*
TRES=cpu=8,mem=20G,node=1
Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
Nodes=... CPU_IDs=0-7 Mem=20480 GRES_IDX=gpu(IDX:0)
MinCPUsNode=8 MinMemoryNode=20G MinTmpDiskNode=0
Features=(null) DelayBoot=00:00:00
Gres=gpu:1 Reservation=(null)
OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
Command=bash
WorkDir=/etc/slurm
Power=
GresEnforceBind=Yes
$ scontrol show job=473 --details
JobId=473 JobName=bash
UserId=root(0) GroupId=root(0) MCS_label=N/A
Priority=4294901758 Nice=0 Account=(null) QOS=(null)
JobState=RUNNING Reason=None Dependency=(null)
Requeue=1 Restarts=0 BatchFlag=0 Reboot=0 ExitCode=0:0
DerivedExitCode=0:0
RunTime=00:30:10 TimeLimit=UNLIMITED TimeMin=N/A
SubmitTime=2019-01-02T00:36:14 EligibleTime=2019-01-02T00:36:14
StartTime=2019-01-02T00:36:14 EndTime=Unknown Deadline=N/A
PreemptTime=None SuspendTime=None SecsPreSuspend=0
Partition=all AllocNode:Sid=...:31738
ReqNodeList=(null) ExcNodeList=(null)
NodeList=...
BatchHost=...
NumNodes=1 NumCPUs=8 NumTasks=1 CPUs/Task=8 ReqB:S:C:T=0:0:*:*
TRES=cpu=8,mem=20G,node=1
Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
Nodes=... CPU_IDs=8-15 Mem=20480 GRES_IDX=gpu(IDX:1)
MinCPUsNode=8 MinMemoryNode=20G MinTmpDiskNode=0
Features=(null) DelayBoot=00:00:00
Gres=gpu:1 Reservation=(null)
OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
Command=bash
WorkDir=/root
Power=
GresEnforceBind=Yes
...
But the problem is here.
when I use Docker, Slurm GRES is not working.
$ srun --gres=gpu:1 --gres-flags=enforce-binding --cpus-per-task=8 --mem=20G --pty bash
$ nvidia-smi
Wed Jan 2 01:02:00 2019
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 410.79 Driver Version: 410.79 CUDA Version: 10.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla P100-PCIE... On | 00000000:14:00.0 Off | 0 |
| N/A 30C P0 27W / 250W | 0MiB / 16280MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
$ docker run --runtime=nvidia --rm nvidia/cuda nvidia-smi
Wed Jan 2 01:10:35 2019
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 410.79 Driver Version: 410.79 CUDA Version: 10.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla P100-PCIE... On |00000000:14:00.0 Off | 0 |
| N/A 30C P0 27W / 250W | 0MiB / 16280MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 Tesla P100-PCIE... On | 00000000:15:00.0 Off | 0 |
| N/A 32C P0 26W / 250W | 0MiB / 16280MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 2 Tesla P100-PCIE... On | 00000000:39:00.0 Off | 0 |
| N/A 30C P0 27W / 250W | 0MiB / 16280MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 3 Tesla P100-PCIE... On | 00000000:3A:00.0 Off | 0 |
| N/A 28C P0 27W / 250W | 0MiB / 16280MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
Here is my configs(slurm.conf, gres.conf)
ControlMachine=...
ControlAddr=...
MailProg=/bin/mail
MpiDefault=none
ReturnToService=2
SlurmctldPidFile=/var/run/slurmctld.pid
SlurmdPidFile=/var/run/slurmd.pid
SlurmdSpoolDir=/var/spool/slurmd
SlurmdUser=root
StateSaveLocation=/var/spool
SwitchType=switch/none
ProctrackType=proctrack/cgroup
TaskPlugin=task/cgroup
AuthType=auth/munge
FastSchedule=1
SchedulerType=sched/backfill
SelectType=select/cons_res
SelectTypeParameters=CR_Core_Memory
GresTypes=gpu
AccountingStorageType=accounting_storage/filetxt
JobCompType=jobcomp/filetxt
JobAcctGatherType=jobacct_gather/cgroup
ClusterName=...
SlurmctldDebug=7
SlurmctldLogFile=/var/log/slurmctld.log
SlurmdDebug=7
SlurmdLogFile=/var/log/slurmd.log
# COMPUTE NODES
NodeName=... NodeHostName=... Gres=gpu:4 CPUs=32 SocketsPerBoard=2 CoresPerSocket=8 ThreadsPerCore=2 RealMemory=128432 State=UNKNOWN
PartitionName=all Nodes=... Default=YES MaxTime=INFINITE State=UP
Name=gpu Type=tesla File=/dev/nvidia0 CPUs=0-7
Name=gpu Type=tesla File=/dev/nvidia1 CPUs=8-15
Name=gpu Type=tesla File=/dev/nvidia2 CPUs=16-23
Name=gpu Type=tesla File=/dev/nvidia3 CPUs=24-31
what's the problem?
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20190102/1039fd25/attachment-0001.html>
More information about the slurm-users
mailing list