[slurm-users] sbatch script won't accept --gres that requires more than 1 gpu

Marcus Wagner wagner at itc.rwth-aachen.de
Wed Feb 5 11:23:03 UTC 2020


I had this same issue today again.

> sbatch: error: CPU count per node can not be satisfied
>
> sbatch: error: Batch job submission failed: Requested node 
> configuration is not available

After restarting slurmctld, the user could submit his job with the very 
same jobscript.

One of the oddities of SLURM we have learned to live with.

Best
Marcus



On 2/5/20 3:00 AM, dean.w.schulze at gmail.com wrote:
> This started working for me this morning.  I have no idea why it started to work.  Maybe it was multiple restarts of the various daemons that did it.
>
>
> -----Original Message-----
> From: slurm-users <slurm-users-bounces at lists.schedmd.com> On Behalf Of Brian W. Johanson
> Sent: Tuesday, February 4, 2020 1:35 PM
> To: slurm-users at lists.schedmd.com
> Subject: Re: [slurm-users] sbatch script won't accept --gres that requires more than 1 gpu
>
> Please include the output for:
> scontrol show node=liqidos-dean-node1
> scontrol show partition=Partition_you_are_attempting_to_submit_to
> and
> any other #SBATCH lines submitted with the failing job.
>
>
>
> On 2/4/20 9:42 AM, dean.w.schulze at gmail.com wrote:
>> I've already restarted slurmctld and slurmd on all nodes.  Still get the same problem.
>>
>> -----Original Message-----
>> From: slurm-users <slurm-users-bounces at lists.schedmd.com> On Behalf Of
>> Marcus Wagner
>> Sent: Tuesday, February 4, 2020 2:31 AM
>> To: slurm-users at lists.schedmd.com
>> Subject: Re: [slurm-users] sbatch script won't accept --gres that
>> requires more than 1 gpu
>>
>> Hi Dean,
>>
>> could you please try to restart the slurmctld?
>>
>> This usually helps on our site.
>> Never saw this with gres happening, but many other times.
>> This is, why we restart slurmctld once a day by a cron job.
>>
>>
>> Best
>> Marcus
>>
>> On 2/4/20 12:59 AM, Dean Schulze wrote:
>>> When I run an sbatch script with the line
>>>
>>> #SBATCH --gres=gpu:gp100:1
>>>
>>> it runs.  When I change it to
>>>
>>> #SBATCH --gres=gpu:gp100:3
>>>
>>> it fails with "Requested node configuration is not available".  But I
>>> have a node with 4 gp100s available.  Here's my slurm.conf:
>>>
>>> NodeName=liqidos-dean-node1 CPUs=2 Boards=1 SocketsPerBoard=2
>>> CoresPerSocket=1 ThreadsPerCore=1 RealMemory=3770 Gres=gpu:gp100:4
>>>
>>> That node has a gres.conf with these lines:
>>>
>>> Name=gpu Type=gp100  File=/dev/nvidia0 Name=gpu Type=gp100
>>> File=/dev/nvidia1 Name=gpu Type=gp100  File=/dev/nvidia2 Name=gpu
>>> Type=gp100  File=/dev/nvidia3
>>>
>>> The character devices all exist in /dev.
>>>
>>> What's the controller complaining about?
>> --
>> Marcus Wagner, Dipl.-Inf.
>>
>> IT Center
>> Abteilung: Systeme und Betrieb
>> RWTH Aachen University
>> Seffenter Weg 23
>> 52074 Aachen
>> Tel: +49 241 80-24383
>> Fax: +49 241 80-624383
>> wagner at itc.rwth-aachen.de
>> www.itc.rwth-aachen.de
>>
>>
>>
>>
>
>
>

-- 
Marcus Wagner, Dipl.-Inf.

IT Center
Abteilung: Systeme und Betrieb
RWTH Aachen University
Seffenter Weg 23
52074 Aachen
Tel: +49 241 80-24383
Fax: +49 241 80-624383
wagner at itc.rwth-aachen.de
www.itc.rwth-aachen.de

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20200205/b6c94d41/attachment.htm>


More information about the slurm-users mailing list