[slurm-users] CUDA environment variable not being set

Relu Patrascu relu at cs.toronto.edu
Thu Oct 8 21:16:22 UTC 2020


Do you have a line like this in  your cgroup_allowed_devices_file.conf
/dev/nvidia*

?

Relu

On 2020-10-08 16:32, Sajesh Singh wrote:
>
> It seems as though the modules are loaded as when I run lsmod I get 
> the following:
>
> nvidia_drm             43714  0
>
> nvidia_modeset       1109636  1 nvidia_drm
>
> nvidia_uvm            935322  0
>
> nvidia              20390295  2 nvidia_modeset,nvidia_uvm
>
> Also the nvidia-smi command returns the following:
>
> nvidia-smi
>
> Thu Oct  8 16:31:57 2020
>
> +-----------------------------------------------------------------------------+
>
> | NVIDIA-SMI 440.64.00    Driver Version: 440.64.00    CUDA Version: 
> 10.2     |
>
> |-------------------------------+----------------------+----------------------+
>
> | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile 
> Uncorr. ECC |
>
> | Fan  Temp  Perf  Pwr:Usage/Cap| Memory-Usage | GPU-Util  Compute M. |
>
> |===============================+======================+======================|
>
> |   0  Quadro M5000        Off  | 00000000:02:00.0 Off 
> |                  Off |
>
> | 33%   21C    P0    45W / 150W |      0MiB /  8126MiB |      0%      
> Default |
>
> +-------------------------------+----------------------+----------------------+
>
> |   1  Quadro M5000        Off  | 00000000:82:00.0 Off 
> |                  Off |
>
> | 30%   17C    P0    45W / 150W |      0MiB /  8126MiB |      0%      
> Default |
>
> +-------------------------------+----------------------+----------------------+
>
> +-----------------------------------------------------------------------------+
>
> | Processes: GPU Memory |
>
> |  GPU       PID   Type   Process name         
>                     Usage      |
>
> |=============================================================================|
>
> |  No running processes 
> found                                                 |
>
> +-----------------------------------------------------------------------------+
>
> --
>
> -SS-
>
> *From:* slurm-users <slurm-users-bounces at lists.schedmd.com> *On Behalf 
> Of *Relu Patrascu
> *Sent:* Thursday, October 8, 2020 4:26 PM
> *To:* slurm-users at lists.schedmd.com
> *Subject:* Re: [slurm-users] CUDA environment variable not being set
>
> *EXTERNAL SENDER*
>
> That usually means you don't have the nvidia kernel module loaded, 
> probably because there's no driver installed.
>
> Relu
>
> On 2020-10-08 14:57, Sajesh Singh wrote:
>
>     Slurm 18.08
>
>     CentOS 7.7.1908
>
>     I have 2 M500 GPUs in a compute node which is defined in the
>     slurm.conf and gres.conf of the cluster, but if I launch a job
>     requesting GPUs the environment variable CUDA_VISIBLE_DEVICES Is
>     never set and I see the following messages in the slurmd.log file:
>
>     debug:  common_gres_set_env: unable to set env vars, no device
>     files configured
>
>     Has anyone encountered this before?
>
>     Thank you,
>
>     SS
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20201008/a002f2ec/attachment-0001.htm>


More information about the slurm-users mailing list