[slurm-users] Invalid device ordinal

Fri May 12 16:40:54 UTC 2023

Hello -

I'm trying to get gpu container jobs working on virtual nodes. The jobs fail with "Test CUDA failure common.cu:893 'invalid device ordinal'" in the output file and "slurmstepd: error:  mpi/pmix_v3: _errhandler: n4 [0]: pmixp_client_v2.c:211: Error handler invoked: status = -25, source = [slurm.pmix.126.0:1]" in the error file. Google points me to issues where others are selecting the wrong GPU or too many GPUs but I'm just trying to get one GPU (per node) working.

Some infos:

  *   slurm-22.05
     *   slurm-22.05.5-1.el9.x86_64
     *   slurm-contribs-22.05.5-1.el9.x86_64
     *   slurm-devel-22.05.5-1.el9.x86_64
     *   slurm-libpmi-22.05.5-1.el9.x86_64
     *   slurm-pam_slurm-22.05.5-1.el9.x86_64
     *   slurm-perlapi-22.05.5-1.el9.x86_64
     *   slurm-slurmctld-22.05.5-1.el9.x86_64
     *   slurm-example-configs-22.05.5-1.el9.x86_64
     *   nvslurm-plugin-pyxis-0.14.0-1.el9.x86_64
  *   Rocky Linux release 9.0 (Blue Onyx)
  *   KVM virtualization
  *   6-node cluster n0 - n5. n4 and n5 have one Tesla V100-SXM2-16GB each.
  *   Driver Version: 530.30.02

My attempt at setting this up:

  *   Configure GresTypes=gpu in slurm.conf
  *   Separate n4 and n5 in slurm.conf to use the GresType
     *   NodeName=n[4-5] GRES=gpu:1 CPUs=3 State=UNKNOWN
  *   Create /etc/slurm/gres.conf on each gpu node
     *   Name=gpu File=/dev/nvidia0
  *   Sync slurm.conf across the cluster and restart slurmd on n[1-5]
  *   Restart slurmctld on n0
  *   Resume n4 and n5
     *   scontrol update nodename=n[4-5] state=resume

References: https://slurm.schedmd.com/gres.html, https://slurm.schedmd.com/gres.conf.html

This little test script works and gives me gpu info:

#!/bin/sh
#SBATCH -J gpu_test
#SBATCH -N 1
#SBATCH -n 3
#SBATCH -w n5
#SBATCH -o %j.o
#SBATCH -e %j.e

nvidia-smi
nvidia-debugdump -l

This script fails with the errors I mentioned above:

#!/bin/sh
#SBATCH -J tfmpi
#SBATCH -N 2
#SBATCH -n 6
#SBATCH -w n[4-5]
#SBATCH -o %j.o
#SBATCH -e %j.e

#SBATCH --gres=gpu:1
#SBATCH --gpus=1

srun --mpi=pmix --container-image=nvcr.io#nvidia/tensorflow:23.02-tf2-py3 all_reduce_perf_mpi -b 1G -e 1G -c 1

What am I missing to get the second script to run?

Thank you.
Cornelius Henderson
Senior Systems Administrator
NASA Center for Climate Simulation (NCCS)
ASRC Federal InuTeq, LLC
Goddard Space Flight Center
Greenbelt, MD 20771

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20230512/89de2eb2/attachment.htm>