[slurm-users] CUDA environment variable not being set

Sajesh Singh ssingh at amnh.org
Thu Oct 8 20:57:26 UTC 2020


I only get a line returned for “Gres=”, but this is the same behavior on another cluster that has GPUs and the variable gets set on that cluster.

-Sajesh-

--
_____________________________________________________
Sajesh Singh
Manager, Systems and Scientific Computing
American Museum of Natural History
200 Central Park West
New York, NY 10024

(O) (212) 313-7263
(C) (917) 763-9038
(E) ssingh at amnh.org

From: slurm-users <slurm-users-bounces at lists.schedmd.com> On Behalf Of Renfro, Michael
Sent: Thursday, October 8, 2020 4:53 PM
To: Slurm User Community List <slurm-users at lists.schedmd.com>
Subject: Re: [slurm-users] CUDA environment variable not being set

EXTERNAL SENDER

From any node you can run scontrol from, what does ‘scontrol show node GPUNODENAME | grep -i gres’ return? Mine return lines for both “Gres=” and “CfgTRES=”.

From: slurm-users <slurm-users-bounces at lists.schedmd.com<mailto:slurm-users-bounces at lists.schedmd.com>> on behalf of Sajesh Singh <ssingh at amnh.org<mailto:ssingh at amnh.org>>
Reply-To: Slurm User Community List <slurm-users at lists.schedmd.com<mailto:slurm-users at lists.schedmd.com>>
Date: Thursday, October 8, 2020 at 3:33 PM
To: Slurm User Community List <slurm-users at lists.schedmd.com<mailto:slurm-users at lists.schedmd.com>>
Subject: Re: [slurm-users] CUDA environment variable not being set


External Email Warning

This email originated from outside the university. Please use caution when opening attachments, clicking links, or responding to requests.

________________________________
It seems as though the modules are loaded as when I run lsmod I get the following:

nvidia_drm             43714  0
nvidia_modeset       1109636  1 nvidia_drm
nvidia_uvm            935322  0
nvidia              20390295  2 nvidia_modeset,nvidia_uvm

Also the nvidia-smi command returns the following:

nvidia-smi
Thu Oct  8 16:31:57 2020
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.64.00    Driver Version: 440.64.00    CUDA Version: 10.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Quadro M5000        Off  | 00000000:02:00.0 Off |                  Off |
| 33%   21C    P0    45W / 150W |      0MiB /  8126MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Quadro M5000        Off  | 00000000:82:00.0 Off |                  Off |
| 30%   17C    P0    45W / 150W |      0MiB /  8126MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

--

-SS-

From: slurm-users <slurm-users-bounces at lists.schedmd.com<mailto:slurm-users-bounces at lists.schedmd.com>> On Behalf Of Relu Patrascu
Sent: Thursday, October 8, 2020 4:26 PM
To: slurm-users at lists.schedmd.com<mailto:slurm-users at lists.schedmd.com>
Subject: Re: [slurm-users] CUDA environment variable not being set

EXTERNAL SENDER


That usually means you don't have the nvidia kernel module loaded, probably because there's no driver installed.

Relu
On 2020-10-08 14:57, Sajesh Singh wrote:
Slurm 18.08
CentOS 7.7.1908

I have 2 M500 GPUs in a compute node which is defined in the slurm.conf and gres.conf of the cluster, but if I launch a job requesting GPUs the environment variable CUDA_VISIBLE_DEVICES Is never set and I see the following messages in the slurmd.log file:

debug:  common_gres_set_env: unable to set env vars, no device files configured

Has anyone encountered this before?

Thank you,

SS
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20201008/38b1e6d0/attachment-0001.htm>


More information about the slurm-users mailing list