[slurm-users] sbatch script won't accept --gres that requires more than 1 gpu

Tue Feb 4 14:42:06 UTC 2020

I've already restarted slurmctld and slurmd on all nodes.  Still get the same problem.

-----Original Message-----
From: slurm-users <slurm-users-bounces at lists.schedmd.com> On Behalf Of Marcus Wagner
Sent: Tuesday, February 4, 2020 2:31 AM
To: slurm-users at lists.schedmd.com
Subject: Re: [slurm-users] sbatch script won't accept --gres that requires more than 1 gpu

Hi Dean,

could you please try to restart the slurmctld?

This usually helps on our site.
Never saw this with gres happening, but many other times.
This is, why we restart slurmctld once a day by a cron job.

Best
Marcus

On 2/4/20 12:59 AM, Dean Schulze wrote:
> When I run an sbatch script with the line
>
> #SBATCH --gres=gpu:gp100:1
>
> it runs.  When I change it to
>
> #SBATCH --gres=gpu:gp100:3
>
> it fails with "Requested node configuration is not available".  But I 
> have a node with 4 gp100s available.  Here's my slurm.conf:
>
> NodeName=liqidos-dean-node1 CPUs=2 Boards=1 SocketsPerBoard=2
> CoresPerSocket=1 ThreadsPerCore=1 RealMemory=3770 Gres=gpu:gp100:4
>
> That node has a gres.conf with these lines:
>
> Name=gpu Type=gp100  File=/dev/nvidia0 Name=gpu Type=gp100  
> File=/dev/nvidia1 Name=gpu Type=gp100  File=/dev/nvidia2 Name=gpu 
> Type=gp100  File=/dev/nvidia3
>
> The character devices all exist in /dev.
>
> What's the controller complaining about?

--
Marcus Wagner, Dipl.-Inf.

IT Center
Abteilung: Systeme und Betrieb
RWTH Aachen University
Seffenter Weg 23
52074 Aachen
Tel: +49 241 80-24383
Fax: +49 241 80-624383
wagner at itc.rwth-aachen.de
www.itc.rwth-aachen.de