[slurm-users] how to locate the problem when slurm failed to restrict gpu usage of user jobs

Brian Andrus toomuchit at gmail.com
Wed Mar 23 14:57:28 UTC 2022


It should exist in the user environment as well.

I would check the users .bashrc and .bash_profile settings to see if 
they are doing anything that will change that.

Brian Andrus

On 3/23/2022 7:42 AM, taleintervenor at sjtu.edu.cn wrote:
>
> Hi, all:
>
> We found a problem that slurm job with argument such as *--gres gpu:1 
> *didn’t be restricted with gpu usage, user still can see all gpu card 
> on allocated nodes.
>
> Our gpu node has 4 cards with their gres.conf to be:
>
> > cat /etc/slurm/gres.conf
>
> Name=gpu Type=NVlink_A100_40GB File=/dev/nvidia0 CPUs=0-15
>
> Name=gpu Type=NVlink_A100_40GB File=/dev/nvidia1 CPUs=16-31
>
> Name=gpu Type=NVlink_A100_40GB File=/dev/nvidia2 CPUs=32-47
>
> Name=gpu Type=NVlink_A100_40GB File=/dev/nvidia3 CPUs=48-63
>
> And for test, we submit simple job batch like:
>
> #!/bin/bash
>
> #SBATCH --job-name=test
>
> #SBATCH --partition=a100
>
> #SBATCH --nodes=1
>
> #SBATCH --ntasks=6
>
> #SBATCH --gres=gpu:1
>
> #SBATCH --reservation="gpu test"
>
> hostname
>
> nvidia-smi
>
> echo end
>
> Then in the out file the nvidia-smi showed all 4 gpu cards. But we 
> expect to see only 1 allocated gpu card.
>
> Official document of slurm said it will set *CUDA_VISIBLE_DEVICES *env 
> var to restrict the gpu card available to user. But we didn’t find 
> such variable exists in job environment. We only confirmed it do exist 
> in prolog script environment by adding debug command “echo 
> $CUDA_VISIBLE_DEVICES” to slurm prolog script.
>
> So how do slurm co-operate with nvidia tools to make job user only see 
> its allocated gpu card? What is the requirement on nvidia gpu drivers, 
> CUDA toolkit or any other part to help slurm correctly restrict the 
> gpu usage?
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20220323/2ece60db/attachment.htm>


More information about the slurm-users mailing list