[slurm-users] how to locate the problem when slurm failed to restrict gpu usage of user jobs

John Hanks griznog at gmail.com
Wed Mar 23 14:56:08 UTC 2022


Do you have a matching Gres=gpu:4 or similar in your node config lines? I'm
not sure if that is still required, but we have it in our config which does
work to isolate GPUs to jobs they are assigned to.

griznog

On Wed, Mar 23, 2022 at 9:45 AM <taleintervenor at sjtu.edu.cn> wrote:

> Hi, all:
>
>
>
> We found a problem that slurm job with argument such as *--gres gpu:1 *didn’t
> be restricted with gpu usage, user still can see all gpu card on allocated
> nodes.
>
> Our gpu node has 4 cards with their gres.conf to be:
>
> > cat /etc/slurm/gres.conf
>
> Name=gpu Type=NVlink_A100_40GB File=/dev/nvidia0 CPUs=0-15
>
> Name=gpu Type=NVlink_A100_40GB File=/dev/nvidia1 CPUs=16-31
>
> Name=gpu Type=NVlink_A100_40GB File=/dev/nvidia2 CPUs=32-47
>
> Name=gpu Type=NVlink_A100_40GB File=/dev/nvidia3 CPUs=48-63
>
>
>
> And for test, we submit simple job batch like:
>
> #!/bin/bash
>
> #SBATCH --job-name=test
>
> #SBATCH --partition=a100
>
> #SBATCH --nodes=1
>
> #SBATCH --ntasks=6
>
> #SBATCH --gres=gpu:1
>
> #SBATCH --reservation="gpu test"
>
> hostname
>
> nvidia-smi
>
> echo end
>
>
>
> Then in the out file the nvidia-smi showed all 4 gpu cards. But we expect
> to see only 1 allocated gpu card.
>
>
>
> Official document of slurm said it will set *CUDA_VISIBLE_DEVICES *env
> var to restrict the gpu card available to user. But we didn’t find such
> variable exists in job environment. We only confirmed it do exist in prolog
> script environment by adding debug command “echo $CUDA_VISIBLE_DEVICES” to
> slurm prolog script.
>
>
>
> So how do slurm co-operate with nvidia tools to make job user only see its
> allocated gpu card? What is the requirement on nvidia gpu drivers, CUDA
> toolkit or any other part to help slurm correctly restrict the gpu usage?
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20220323/844575ec/attachment.htm>


More information about the slurm-users mailing list