[slurm-users] [EXT] [Beginner, SLURM 20.11.2] Unable to allocate resources when specifying gres in srun or sbatch

Sun Apr 11 14:18:43 UTC 2021

Hi Sean,
Tried as suggested but still getting the same error.
This is the node configuration visible to 'scontrol' just in case
➜  scontrol show node
NodeName=nodeGPU01 Arch=x86_64 CoresPerSocket=16
   CPUAlloc=0 CPUTot=256 CPULoad=8.07
   AvailableFeatures=ht,gpu
   ActiveFeatures=ht,gpu
   Gres=gpu:A100:8
   NodeAddr=nodeGPU01 NodeHostName=nodeGPU01 Version=20.11.2
   OS=Linux 5.4.0-66-generic #74-Ubuntu SMP Wed Jan 27 22:54:38 UTC 2021
   RealMemory=1024000 AllocMem=0 FreeMem=1019774 Sockets=8 Boards=1
   State=IDLE ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
   Partitions=gpu,cpu
   BootTime=2021-04-09T21:23:14 SlurmdStartTime=2021-04-11T10:11:12
   CfgTRES=cpu=256,mem=1000G,billing=256
   AllocTRES=
   CapWatts=n/a
   CurrentWatts=0 AveWatts=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
   Comment=(null)

On Sun, Apr 11, 2021 at 2:03 AM Sean Crosby <scrosby at unimelb.edu.au> wrote:

> Hi Cristobal,
>
> My hunch is it is due to the default memory/CPU settings.
>
> Does it work if you do
>
> srun --gres=gpu:A100:1 --cpus-per-task=1 --mem=10G nvidia-smi
>
> Sean
> --
> Sean Crosby | Senior DevOpsHPC Engineer and HPC Team Lead
> Research Computing Services | Business Services
> The University of Melbourne, Victoria 3010 Australia
>
>
>
> On Sun, 11 Apr 2021 at 15:26, Cristóbal Navarro <
> cristobal.navarro.g at gmail.com> wrote:
>
>> * UoM notice: External email. Be cautious of links, attachments, or
>> impersonation attempts *
>> ------------------------------
>> Hi Community,
>> These last two days I've been trying to understand what is the cause of
>> the "Unable to allocate resources" error I keep getting when specifying
>> --gres=...  in a srun command (or sbatch). It fails with the error
>> ➜  srun --gres=gpu:A100:1 nvidia-smi
>> srun: error: Unable to allocate resources: Requested node configuration
>> is not available
>>
>> log file on the master node (not the compute one)
>> ➜  tail -f /var/log/slurm/slurmctld.log
>> [2021-04-11T01:12:23.270] gres:gpu(7696487) type:(null)(0) job:1317
>> flags: state
>> [2021-04-11T01:12:23.270]   gres_per_node:1 node_cnt:0
>> [2021-04-11T01:12:23.270]   ntasks_per_gres:65534
>> [2021-04-11T01:12:23.270] select/cons_res: common_job_test: no
>> job_resources info for JobId=1317 rc=-1
>> [2021-04-11T01:12:23.270] select/cons_res: common_job_test: no
>> job_resources info for JobId=1317 rc=-1
>> [2021-04-11T01:12:23.270] select/cons_res: common_job_test: no
>> job_resources info for JobId=1317 rc=-1
>> [2021-04-11T01:12:23.271] _pick_best_nodes: JobId=1317 never runnable in
>> partition gpu
>> [2021-04-11T01:12:23.271] _slurm_rpc_allocate_resources: Requested node
>> configuration is not available
>>
>> If launched without --gres, it allocates all GPUs by default and
>> nvidia-smi does work, in fact our CUDA programs do work via SLURM if --gres
>> is not specified.
>> ➜  TUT04-GPU-multi git:(master) ✗ srun nvidia-smi
>> Sun Apr 11 01:05:47 2021
>>
>> +-----------------------------------------------------------------------------+
>> | NVIDIA-SMI 450.102.04   Driver Version: 450.102.04   CUDA Version: 11.0
>>     |
>>
>> |-------------------------------+----------------------+----------------------+
>> | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr.
>> ECC |
>> | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util
>>  Compute M. |
>> |                               |                      |
>> MIG M. |
>>
>> |===============================+======================+======================|
>> |   0  A100-SXM4-40GB      On   | 00000000:07:00.0 Off |
>>    0 |
>> | N/A   31C    P0    51W / 400W |      0MiB / 40537MiB |      0%
>>  Default |
>> |                               |                      |
>> Disabled |
>> ....
>> ....
>>
>> There is only one DGX A100 Compute node with 8 GPUs and 2x 64-core CPUs,
>> and the gres.conf file simply is (also tried the commented lines):
>> ➜  ~ cat /etc/slurm/gres.conf
>> # GRES configuration for native GPUS
>> # DGX A100 8x Nvidia A100
>> #AutoDetect=nvml
>> Name=gpu Type=A100 File=/dev/nvidia[0-7]
>>
>> #Name=gpu Type=A100 File=/dev/nvidia0 Cores=0-7
>> #Name=gpu Type=A100 File=/dev/nvidia1 Cores=8-15
>> #Name=gpu Type=A100 File=/dev/nvidia2 Cores=16-23
>> #Name=gpu Type=A100 File=/dev/nvidia3 Cores=24-31
>> #Name=gpu Type=A100 File=/dev/nvidia4 Cores=32-39
>> #Name=gpu Type=A100 File=/dev/nvidia5 Cores=40-47
>> #Name=gpu Type=A100 File=/dev/nvidia6 Cores=48-55
>> #Name=gpu Type=A100 File=/dev/nvidia7 Cores=56-63
>>
>>
>> Some relevant parts of the slurm.conf file
>> ➜  cat /etc/slurm/slurm.conf
>> ...
>> ## GRES
>> GresTypes=gpu
>> AccountingStorageTRES=gres/gpu
>> DebugFlags=CPU_Bind,gres
>> ...
>> ## Nodes list
>> ## Default CPU layout, native GPUs
>> NodeName=nodeGPU01 SocketsPerBoard=8 CoresPerSocket=16 ThreadsPerCore=2
>> RealMemory=1024000 State=UNKNOWN Gres=gpu:A100:8 Feature=ht,gpu
>> ...
>> ## Partitions list
>> PartitionName=gpu OverSubscribe=FORCE MaxCPUsPerNode=128 MaxTime=INFINITE
>> State=UP Nodes=nodeGPU01  Default=YES
>> PartitionName=cpu OverSubscribe=FORCE MaxCPUsPerNode=128 MaxTime=INFINITE
>> State=UP Nodes=nodeGPU01
>>
>> Any ideas where should I check?
>> thanks in advance
>> --
>> Cristóbal A. Navarro
>>
>

-- 
Cristóbal A. Navarro
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20210411/98393d4d/attachment.htm>