[slurm-users] mps on A100 only zero-index GPU was used when there is four GPUs

刘文晓 wenxiaoll at 126.com
Tue Feb 22 01:56:01 UTC 2022


Hi threre,


I was testing the MPS on Slurm19.05.5 with 4 A100 in compute node. In my opinion, the 4 A100 will be used.  But I found that only the first GPU was used. like below:
the job script:
#!/bin/bash
#SBATCH -J date
#SBATCH -p NVIDIAA100-PCIE-40GB
#SBATCH -n 1
#SBATCH --gres=mps:100
#SBATCH --mem 1024
#SBATCH -o /home/zren/%j.out
#SBATCH -e /home/zren/%j.out


echo $CUDA_VISIBLE_DEVICES
echo $CUDA_MPS_ACTIVE_THREAD_PERCENTAGE
./vectorAdd


output of squeue, only one job is running:
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
               291 NVIDIAA10     date     zren PD       0:00      1 (Resources)
               292 NVIDIAA10     date     zren PD       0:00      1 (Priority)
               293 NVIDIAA10     date     zren PD       0:00      1 (Priority)
               290 NVIDIAA10     date     zren  R       0:04      1 mig4


output of nvidia-smi, only 0 index GPU was used:
Tue Feb 22 09:47:45 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.47.03    Driver Version: 510.47.03    CUDA Version: 11.6     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A100-PCI...  On   | 00000000:18:00.0 Off |                    0 |
| N/A   33C    P0    36W / 250W |    415MiB / 40960MiB |     31%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA A100-PCI...  On   | 00000000:5E:00.0 Off |                    0 |
| N/A   30C    P0    33W / 250W |      0MiB / 40960MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   2  NVIDIA A100-PCI...  On   | 00000000:AF:00.0 Off |                    0 |
| N/A   28C    P0    32W / 250W |      0MiB / 40960MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   3  NVIDIA A100-PCI...  On   | 00000000:D8:00.0 Off |                    0 |
| N/A   30C    P0    34W / 250W |      0MiB / 40960MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+


+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A     10228      C   ./vectorAdd                       413MiB |
+-----------------------------------------------------------------------------+
the configuration of slurm.conf and gres.conf:
NodeName=mig4 CPUs=24 Boards=1 SocketsPerBoard=2 CoresPerSocket=12 ThreadsPerCore=1 RealMemory=191907 MemSpecLimit=10240 Gres=gpu:4,mps:400 State=UNKNOWN


AutoDetect=nvml
Name=gpu Type=nvidia_a100-pcie-40gb File=/dev/nvidia0
Name=gpu Type=nvidia_a100-pcie-40gb File=/dev/nvidia1
Name=gpu Type=nvidia_a100-pcie-40gb File=/dev/nvidia2 
Name=gpu Type=nvidia_a100-pcie-40gb File=/dev/nvidia3
Name=mps Count=400


and some logs for job 291 which is in resources state in slurmctld.log:
[2022-02-22T09:47:12.890] debug3: _pick_best_nodes: JobId=291 idle_nodes 0 share_nodes 1
[2022-02-22T09:47:12.890] debug2: select/cons_tres: select_p_job_test: evaluating JobId=291
[2022-02-22T09:47:12.890] select/cons_tres: select_p_job_test: JobId=291 node_mode:Normal alloc_mode:Run_Now
[2022-02-22T09:47:12.890] select/cons_tres: select_p_job_test: node_list:mig4 exc_cores:NONE
[2022-02-22T09:47:12.890] select/cons_tres: select_p_job_test: nodes: min:1 max:500000 requested:1 avail:1
[2022-02-22T09:47:12.890] select/cons_tres: _job_test: evaluating JobId=291 on 1 nodes
[2022-02-22T09:47:12.890] select/cons_tres: _job_test: test 0 fail: insufficient resources
[2022-02-22T09:47:12.890] select/cons_tres: select_p_job_test: no job_resources info for JobId=291 rc=-1
[2022-02-22T09:47:12.890] debug2: select/cons_tres: select_p_job_test: evaluating JobId=291
[2022-02-22T09:47:12.890] select/cons_tres: select_p_job_test: JobId=291 node_mode:Normal alloc_mode:Test_Only
[2022-02-22T09:47:12.890] select/cons_tres: select_p_job_test: node_list:mig4 exc_cores:NONE
[2022-02-22T09:47:12.890] select/cons_tres: select_p_job_test: nodes: min:1 max:500000 requested:1 avail:1
[2022-02-22T09:47:12.890] select/cons_tres: _job_test: evaluating JobId=291 on 1 nodes
[2022-02-22T09:47:12.890] select/cons_tres: _can_job_run_on_node: 24 CPUs on mig4(state:1), mem 1024/191907
[2022-02-22T09:47:12.890] select/cons_tres: eval_nodes: set:0 consec CPUs:1 nodes:1:mig4 begin:0 end:0 required:-1 weight:511
[2022-02-22T09:47:12.890] select/cons_tres: _job_test: test 0 pass: test_only
[2022-02-22T09:47:12.890] select/cons_tres: select_p_job_test: no job_resources info for JobId=291 rc=0


thanks
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20220222/67d6adcd/attachment.htm>


More information about the slurm-users mailing list