[slurm-users] sbatch script won't accept --gres that requires more than 1 gpu
dean.w.schulze at gmail.com
dean.w.schulze at gmail.com
Wed Feb 5 02:00:27 UTC 2020
This started working for me this morning. I have no idea why it started to work. Maybe it was multiple restarts of the various daemons that did it.
-----Original Message-----
From: slurm-users <slurm-users-bounces at lists.schedmd.com> On Behalf Of Brian W. Johanson
Sent: Tuesday, February 4, 2020 1:35 PM
To: slurm-users at lists.schedmd.com
Subject: Re: [slurm-users] sbatch script won't accept --gres that requires more than 1 gpu
Please include the output for:
scontrol show node=liqidos-dean-node1
scontrol show partition=Partition_you_are_attempting_to_submit_to
and
any other #SBATCH lines submitted with the failing job.
On 2/4/20 9:42 AM, dean.w.schulze at gmail.com wrote:
> I've already restarted slurmctld and slurmd on all nodes. Still get the same problem.
>
> -----Original Message-----
> From: slurm-users <slurm-users-bounces at lists.schedmd.com> On Behalf Of
> Marcus Wagner
> Sent: Tuesday, February 4, 2020 2:31 AM
> To: slurm-users at lists.schedmd.com
> Subject: Re: [slurm-users] sbatch script won't accept --gres that
> requires more than 1 gpu
>
> Hi Dean,
>
> could you please try to restart the slurmctld?
>
> This usually helps on our site.
> Never saw this with gres happening, but many other times.
> This is, why we restart slurmctld once a day by a cron job.
>
>
> Best
> Marcus
>
> On 2/4/20 12:59 AM, Dean Schulze wrote:
>> When I run an sbatch script with the line
>>
>> #SBATCH --gres=gpu:gp100:1
>>
>> it runs. When I change it to
>>
>> #SBATCH --gres=gpu:gp100:3
>>
>> it fails with "Requested node configuration is not available". But I
>> have a node with 4 gp100s available. Here's my slurm.conf:
>>
>> NodeName=liqidos-dean-node1 CPUs=2 Boards=1 SocketsPerBoard=2
>> CoresPerSocket=1 ThreadsPerCore=1 RealMemory=3770 Gres=gpu:gp100:4
>>
>> That node has a gres.conf with these lines:
>>
>> Name=gpu Type=gp100 File=/dev/nvidia0 Name=gpu Type=gp100
>> File=/dev/nvidia1 Name=gpu Type=gp100 File=/dev/nvidia2 Name=gpu
>> Type=gp100 File=/dev/nvidia3
>>
>> The character devices all exist in /dev.
>>
>> What's the controller complaining about?
> --
> Marcus Wagner, Dipl.-Inf.
>
> IT Center
> Abteilung: Systeme und Betrieb
> RWTH Aachen University
> Seffenter Weg 23
> 52074 Aachen
> Tel: +49 241 80-24383
> Fax: +49 241 80-624383
> wagner at itc.rwth-aachen.de
> www.itc.rwth-aachen.de
>
>
>
>
More information about the slurm-users
mailing list