[slurm-users] [EXT] [Beginner, SLURM 20.11.2] Unable to allocate resources when specifying gres in srun or sbatch

Sun Apr 11 05:59:52 UTC 2021

Hi Cristobal,

My hunch is it is due to the default memory/CPU settings.

Does it work if you do

srun --gres=gpu:A100:1 --cpus-per-task=1 --mem=10G nvidia-smi

Sean
--
Sean Crosby | Senior DevOpsHPC Engineer and HPC Team Lead
Research Computing Services | Business Services
The University of Melbourne, Victoria 3010 Australia

On Sun, 11 Apr 2021 at 15:26, Cristóbal Navarro <
cristobal.navarro.g at gmail.com> wrote:

> * UoM notice: External email. Be cautious of links, attachments, or
> impersonation attempts *
> ------------------------------
> Hi Community,
> These last two days I've been trying to understand what is the cause of
> the "Unable to allocate resources" error I keep getting when specifying
> --gres=...  in a srun command (or sbatch). It fails with the error
> ➜  srun --gres=gpu:A100:1 nvidia-smi
> srun: error: Unable to allocate resources: Requested node configuration is
> not available
>
> log file on the master node (not the compute one)
> ➜  tail -f /var/log/slurm/slurmctld.log
> [2021-04-11T01:12:23.270] gres:gpu(7696487) type:(null)(0) job:1317 flags:
> state
> [2021-04-11T01:12:23.270]   gres_per_node:1 node_cnt:0
> [2021-04-11T01:12:23.270]   ntasks_per_gres:65534
> [2021-04-11T01:12:23.270] select/cons_res: common_job_test: no
> job_resources info for JobId=1317 rc=-1
> [2021-04-11T01:12:23.270] select/cons_res: common_job_test: no
> job_resources info for JobId=1317 rc=-1
> [2021-04-11T01:12:23.270] select/cons_res: common_job_test: no
> job_resources info for JobId=1317 rc=-1
> [2021-04-11T01:12:23.271] _pick_best_nodes: JobId=1317 never runnable in
> partition gpu
> [2021-04-11T01:12:23.271] _slurm_rpc_allocate_resources: Requested node
> configuration is not available
>
> If launched without --gres, it allocates all GPUs by default and
> nvidia-smi does work, in fact our CUDA programs do work via SLURM if --gres
> is not specified.
> ➜  TUT04-GPU-multi git:(master) ✗ srun nvidia-smi
> Sun Apr 11 01:05:47 2021
>
> +-----------------------------------------------------------------------------+
> | NVIDIA-SMI 450.102.04   Driver Version: 450.102.04   CUDA Version: 11.0
>     |
>
> |-------------------------------+----------------------+----------------------+
> | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr.
> ECC |
> | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute
> M. |
> |                               |                      |               MIG
> M. |
>
> |===============================+======================+======================|
> |   0  A100-SXM4-40GB      On   | 00000000:07:00.0 Off |
>  0 |
> | N/A   31C    P0    51W / 400W |      0MiB / 40537MiB |      0%
>  Default |
> |                               |                      |
> Disabled |
> ....
> ....
>
> There is only one DGX A100 Compute node with 8 GPUs and 2x 64-core CPUs,
> and the gres.conf file simply is (also tried the commented lines):
> ➜  ~ cat /etc/slurm/gres.conf
> # GRES configuration for native GPUS
> # DGX A100 8x Nvidia A100
> #AutoDetect=nvml
> Name=gpu Type=A100 File=/dev/nvidia[0-7]
>
> #Name=gpu Type=A100 File=/dev/nvidia0 Cores=0-7
> #Name=gpu Type=A100 File=/dev/nvidia1 Cores=8-15
> #Name=gpu Type=A100 File=/dev/nvidia2 Cores=16-23
> #Name=gpu Type=A100 File=/dev/nvidia3 Cores=24-31
> #Name=gpu Type=A100 File=/dev/nvidia4 Cores=32-39
> #Name=gpu Type=A100 File=/dev/nvidia5 Cores=40-47
> #Name=gpu Type=A100 File=/dev/nvidia6 Cores=48-55
> #Name=gpu Type=A100 File=/dev/nvidia7 Cores=56-63
>
>
> Some relevant parts of the slurm.conf file
> ➜  cat /etc/slurm/slurm.conf
> ...
> ## GRES
> GresTypes=gpu
> AccountingStorageTRES=gres/gpu
> DebugFlags=CPU_Bind,gres
> ...
> ## Nodes list
> ## Default CPU layout, native GPUs
> NodeName=nodeGPU01 SocketsPerBoard=8 CoresPerSocket=16 ThreadsPerCore=2
> RealMemory=1024000 State=UNKNOWN Gres=gpu:A100:8 Feature=ht,gpu
> ...
> ## Partitions list
> PartitionName=gpu OverSubscribe=FORCE MaxCPUsPerNode=128 MaxTime=INFINITE
> State=UP Nodes=nodeGPU01  Default=YES
> PartitionName=cpu OverSubscribe=FORCE MaxCPUsPerNode=128 MaxTime=INFINITE
> State=UP Nodes=nodeGPU01
>
> Any ideas where should I check?
> thanks in advance
> --
> Cristóbal A. Navarro
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20210411/4d486197/attachment-0001.htm>