[slurm-users] slurm, gres:gpu, only 1 GPU out of 4 is detected

Wed Nov 13 16:17:45 UTC 2019

Pretty sure you don’t need to explicitly specify GPU IDs on a Gromacs job running inside of Slurm with gres=gpu. Gromacs should only see the GPUs you have reserved for that job.

Here’s a verification code you can run to verify that two different GPU jobs see different GPU devices (compile with nvcc):

=====

// From http://www.cs.fsu.edu/~xyuan/cda5125/examples/lect24/devicequery.cu
#include <stdio.h>
void printDevProp(cudaDeviceProp dP)
{
    printf("%s has %d multiprocessors\n", dP.name, dP.multiProcessorCount);
    printf("%s has PCI BusID %d, DeviceID %d\n", dP.name, dP.pciBusID, dP.pciDeviceID);
}
int main()
{
    // Number of CUDA devices
    int devCount; cudaGetDeviceCount(&devCount);
    printf("There are %d CUDA devices.\n", devCount);
    // Iterate through devices
    for (int i = 0; i < devCount; ++i)
    {
        // Get device properties
        printf("CUDA Device #%d: ", i);
        cudaDeviceProp devProp; cudaGetDeviceProperties(&devProp, i);
        printDevProp(devProp);
    }
    return 0;
}

=====

When run from two simultaneous jobs on the same node (each with a gres=gpu), I get:

=====

[renfro at gpunode003(job 221584) hw]$ ./cuda_props
There are 1 CUDA devices.
CUDA Device #0: Tesla K80 has 13 multiprocessors
Tesla K80 has PCI BusID 5, DeviceID 0

=====

[renfro at gpunode003(job 221585) hw]$ ./cuda_props
There are 1 CUDA devices.
CUDA Device #0: Tesla K80 has 13 multiprocessors
Tesla K80 has PCI BusID 6, DeviceID 0

=====

-- 
Mike Renfro, PhD / HPC Systems Administrator, Information Technology Services
931 372-3601     / Tennessee Tech University

> On Nov 13, 2019, at 9:54 AM, Tamas Hegedus <tamas at hegelab.org> wrote:
> 
> External Email Warning
> 
> This email originated from outside the university. Please use caution when opening attachments, clicking links, or responding to requests.
> 
> ________________________________
> 
> Hi,
> 
> I run gmx 2019 using GPU
> There are 4 GPUs in my GPU hosts.
> I have slurm and configured gres=gpu
> 
> 1. If I submit a job with --gres=gpu:1 then GPU#0 is identified and used
> (-gpu_id $CUDA_VISIBLE_DEVICES).
> 2. If I submit a second job, it fails: the $CUDA_VISIBLE_DEVICES is 1
> and selected, but GPU #0 is identified by gmx as a compatible gpu.
> From the output:
> 
> gmx mdrun -v -pin on -deffnm equi_nvt -nt 8 -gpu_id 1 -nb gpu -pme gpu
> -npme 1 -ntmpi 4
> 
>  GPU info:
>    Number of GPUs detected: 1
>    #0: NVIDIA GeForce GTX 1080 Ti, compute cap.: 6.1, ECC:  no, stat:
> compatible
> 
> Fatal error:
> You limited the set of compatible GPUs to a set that included ID #1, but
> that
> ID is not for a compatible GPU. List only compatible GPUs.
> 
> 3. If I login to that node and run the mdrun command written into the
> output in the previous step then it selects the right gpu and runs as
> expected.
> 
> $CUDA_DEVICE_ORDER is set to PCI_BUS_ID
> 
> I can not decide if this is a slurm config error or something with
> gromacs, as $CUDA_VISIBLE_DEVICES is set correctly by slurm and I expect
> gromacs to detect all 4GPUs.
> 
> Thanks for your help and suggestions,
> Tamas
> 
> --
> 
> Tamas Hegedus, PhD
> Senior Research Fellow
> Department of Biophysics and Radiation Biology
> Semmelweis University     | phone: (36) 1-459 1500/60233
> Tuzolto utca 37-47        | mailto:tamas at hegelab.org
> Budapest, 1094, Hungary   | http://www.hegelab.org
> 
>