[slurm-users] mps on A100 only zero-index GPU was used when there is four GPUs
刘文晓
wenxiaoll at 126.com
Tue Feb 22 01:56:01 UTC 2022
Hi threre,
I was testing the MPS on Slurm19.05.5 with 4 A100 in compute node. In my opinion, the 4 A100 will be used. But I found that only the first GPU was used. like below:
the job script:
#!/bin/bash
#SBATCH -J date
#SBATCH -p NVIDIAA100-PCIE-40GB
#SBATCH -n 1
#SBATCH --gres=mps:100
#SBATCH --mem 1024
#SBATCH -o /home/zren/%j.out
#SBATCH -e /home/zren/%j.out
echo $CUDA_VISIBLE_DEVICES
echo $CUDA_MPS_ACTIVE_THREAD_PERCENTAGE
./vectorAdd
output of squeue, only one job is running:
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
291 NVIDIAA10 date zren PD 0:00 1 (Resources)
292 NVIDIAA10 date zren PD 0:00 1 (Priority)
293 NVIDIAA10 date zren PD 0:00 1 (Priority)
290 NVIDIAA10 date zren R 0:04 1 mig4
output of nvidia-smi, only 0 index GPU was used:
Tue Feb 22 09:47:45 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.47.03 Driver Version: 510.47.03 CUDA Version: 11.6 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA A100-PCI... On | 00000000:18:00.0 Off | 0 |
| N/A 33C P0 36W / 250W | 415MiB / 40960MiB | 31% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 1 NVIDIA A100-PCI... On | 00000000:5E:00.0 Off | 0 |
| N/A 30C P0 33W / 250W | 0MiB / 40960MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 2 NVIDIA A100-PCI... On | 00000000:AF:00.0 Off | 0 |
| N/A 28C P0 32W / 250W | 0MiB / 40960MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 3 NVIDIA A100-PCI... On | 00000000:D8:00.0 Off | 0 |
| N/A 30C P0 34W / 250W | 0MiB / 40960MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 10228 C ./vectorAdd 413MiB |
+-----------------------------------------------------------------------------+
the configuration of slurm.conf and gres.conf:
NodeName=mig4 CPUs=24 Boards=1 SocketsPerBoard=2 CoresPerSocket=12 ThreadsPerCore=1 RealMemory=191907 MemSpecLimit=10240 Gres=gpu:4,mps:400 State=UNKNOWN
AutoDetect=nvml
Name=gpu Type=nvidia_a100-pcie-40gb File=/dev/nvidia0
Name=gpu Type=nvidia_a100-pcie-40gb File=/dev/nvidia1
Name=gpu Type=nvidia_a100-pcie-40gb File=/dev/nvidia2
Name=gpu Type=nvidia_a100-pcie-40gb File=/dev/nvidia3
Name=mps Count=400
and some logs for job 291 which is in resources state in slurmctld.log:
[2022-02-22T09:47:12.890] debug3: _pick_best_nodes: JobId=291 idle_nodes 0 share_nodes 1
[2022-02-22T09:47:12.890] debug2: select/cons_tres: select_p_job_test: evaluating JobId=291
[2022-02-22T09:47:12.890] select/cons_tres: select_p_job_test: JobId=291 node_mode:Normal alloc_mode:Run_Now
[2022-02-22T09:47:12.890] select/cons_tres: select_p_job_test: node_list:mig4 exc_cores:NONE
[2022-02-22T09:47:12.890] select/cons_tres: select_p_job_test: nodes: min:1 max:500000 requested:1 avail:1
[2022-02-22T09:47:12.890] select/cons_tres: _job_test: evaluating JobId=291 on 1 nodes
[2022-02-22T09:47:12.890] select/cons_tres: _job_test: test 0 fail: insufficient resources
[2022-02-22T09:47:12.890] select/cons_tres: select_p_job_test: no job_resources info for JobId=291 rc=-1
[2022-02-22T09:47:12.890] debug2: select/cons_tres: select_p_job_test: evaluating JobId=291
[2022-02-22T09:47:12.890] select/cons_tres: select_p_job_test: JobId=291 node_mode:Normal alloc_mode:Test_Only
[2022-02-22T09:47:12.890] select/cons_tres: select_p_job_test: node_list:mig4 exc_cores:NONE
[2022-02-22T09:47:12.890] select/cons_tres: select_p_job_test: nodes: min:1 max:500000 requested:1 avail:1
[2022-02-22T09:47:12.890] select/cons_tres: _job_test: evaluating JobId=291 on 1 nodes
[2022-02-22T09:47:12.890] select/cons_tres: _can_job_run_on_node: 24 CPUs on mig4(state:1), mem 1024/191907
[2022-02-22T09:47:12.890] select/cons_tres: eval_nodes: set:0 consec CPUs:1 nodes:1:mig4 begin:0 end:0 required:-1 weight:511
[2022-02-22T09:47:12.890] select/cons_tres: _job_test: test 0 pass: test_only
[2022-02-22T09:47:12.890] select/cons_tres: select_p_job_test: no job_resources info for JobId=291 rc=0
thanks
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20220222/67d6adcd/attachment.htm>
More information about the slurm-users
mailing list