[slurm-users] [EXT] [Beginner, SLURM 20.11.2] Unable to allocate resources when specifying gres in srun or sbatch

Thu May 13 21:14:17 UTC 2021

Hi Sean and Community,
Some days ago I changed to the cons_tres plugin and also made
AutoDetect=nvml work for gres.conf (attached at the end of the email), Node
and partition definitions seem to be OK (attached at the end as well).
I believe the SLURM setup is just a few steps of being properly set up,
currently I have two very basic scenarios that are giving me
questions/problems, :

*For 1) Running GPU jobs without containers*:
I was expecting that when doing for example "srun -p gpu --gres=gpu:A100:1
nvidia-smi -L", the output would be just 1 GPU. However it is not the case.
➜  TUT03-GPU-single srun -p gpu --gres=gpu:A100:1 nvidia-smi -L
GPU 0: A100-SXM4-40GB (UUID: GPU-baa4736e-088f-77ce-0290-ba745327ca95)
GPU 1: A100-SXM4-40GB (UUID: GPU-d40a3b1b-006b-37de-8b72-669c59d14954)
GPU 2: A100-SXM4-40GB (UUID: GPU-35a012ac-2b34-b68f-d922-24aa07af1be6)
GPU 3: A100-SXM4-40GB (UUID: GPU-b75a4bf8-123b-a8c0-dc75-7709626ead20)
GPU 4: A100-SXM4-40GB (UUID: GPU-9366ff9f-a20a-004e-36eb-8376655b1419)
GPU 5: A100-SXM4-40GB (UUID: GPU-75da7cd5-daf3-10fd-2c3f-56259c1dc777)
GPU 6: A100-SXM4-40GB (UUID: GPU-f999e415-54e5-9d7f-0c4b-1d4d98a1dbfc)
GPU 7: A100-SXM4-40GB (UUID: GPU-cce4a787-1b22-bed7-1e93-612906567a0e)

But still, when opening an interactive session It really provides 1 GPU.
➜  TUT03-GPU-single srun -p gpu --gres=gpu:A100:1 --pty bash
user at nodeGPU01:$ echo $CUDA_VISIBLE_DEVICES
2

Moreover, I tried running simultaneous jobs, each one with
--gres=gpu:A100:1 and the source code logically choosing GPU ID 0,  and
indeed different physical GPUs get used which is great. My only concern
here for *1) *is that list that is always displaying all of the devices. It
could confuse users, making them think they have all those GPUs at their
disposal leading to take wrong decisions. Nevertheless, this issue is not
critical compared to the next one.

*2) Running GPU jobs with containers (pyxis + enroot)*
For this case, the list of GPUs does get reduced to the number of select
devices with gres, however there seems to be a problem when referring to
GPU IDs from inside the container, and the mapping to the physical GPUs,
giving a runtime error in CUDA.

Doing nvidia-smi gives
➜  TUT03-GPU-single srun -p gpu --container-name=cuda-11.2.2
--container-image=cuda-11.2.2 --pty --gres=gpu:A100:1 nvidia-smi -L

GPU 0: A100-SXM4-40GB (UUID: GPU-35a012ac-2b34-b68f-d922-24aa07af1be6)
As we can see, physical GPU2 is allocated (we can check with the UUID).
>From what I understand from the idea of SLURM, the programmer does not need
to know that this is GPU ID 2, he/she can just develop a program thinking
on GPU ID 0 because there is only 1 GPU allocated. That is how it worked in
case 1), otherwise one could not know which GPU ID is the one available.

Now, If I launch a job with --gres=gpu:A100:1,something like a CUDA matrix
multiply with some nvml info printed I get
➜  TUT03-GPU-single srun -p gpu --container-name=cuda-11.2.2
--container-image=cuda-11.2.2 --pty --gres=gpu:A100:1 ./prog 0 $((1024*40))
1
Driver version: 450.102.04
NUM GPUS = 1
Listing devices:
GPU0 A100-SXM4-40GB, index=0, UUID=GPU-35a012ac-2b34-b68f-d922-24aa07af1be6
 -> util = 0%
Choosing GPU 0
GPUassert: no CUDA-capable device is detected main.cu 112
srun: error: nodeGPU01: task 0: Exited with exit code 100

the "index=.." is the GPU index given by nvml.
Now If I do --gres=gpu:A100:3,  the real first GPU gets allocated, and the
program works, but It is not the way in which SLURM should work.
➜  TUT03-GPU-single srun -p gpu --container-name=cuda-11.2.2
--container-image=cuda-11.2.2 --pty --gres=gpu:A100:3 ./prog 0 $((1024*40))
1
Driver version: 450.102.04
NUM GPUS = 3
Listing devices:
GPU0 A100-SXM4-40GB, index=0, UUID=GPU-baa4736e-088f-77ce-0290-ba745327ca95
 -> util = 0%
GPU1 A100-SXM4-40GB, index=1, UUID=GPU-35a012ac-2b34-b68f-d922-24aa07af1be6
 -> util = 0%
GPU2 A100-SXM4-40GB, index=2, UUID=GPU-b75a4bf8-123b-a8c0-dc75-7709626ead20
 -> util = 0%
Choosing GPU 0
initializing A and B.......done
matmul shared mem..........done: time: 26.546274 secs
copying result to host.....done
verifying result...........done

I find that very strange that when using containers, the GPU0 from inside
the JOB seems to be trying to access the real physical GPU0 from the
machine, and not the GPU0 provided by SLURM as in 1) which worked well.

If anyone has advice where to look for any of the two issues, I would
really appreciate it
Many thanks in advance and sorry for this long email.
-- Cristobal

---------------------
CONFIG FILES
*# gres.conf*
➜  ~ cat /etc/slurm/gres.conf
AutoDetect=nvml

*# slurm.conf*

*....*
## Basic scheduling
SelectType=select/cons_tres
SelectTypeParameters=CR_Core_Memory,CR_ONE_TASK_PER_CORE
SchedulerType=sched/backfill

## Accounting
AccountingStorageType=accounting_storage/slurmdbd
AccountingStoreJobComment=YES
JobAcctGatherFrequency=30
JobAcctGatherType=jobacct_gather/linux
AccountingStorageHost=10.10.0.1

TaskPlugin=task/cgroup
ProctrackType=proctrack/cgroup

## scripts
Epilog=/etc/slurm/epilog
Prolog=/etc/slurm/prolog
PrologFlags=Alloc

## Nodes list
NodeName=nodeGPU01 SocketsPerBoard=8 CoresPerSocket=16 ThreadsPerCore=1
RealMemory=1024000 MemSpecLimit=65556 State=UNKNOWN Gres=gpu:A100:8
Feature=gpu

## Partitions list
PartitionName=gpu OverSubscribe=No MaxCPUsPerNode=64 DefMemPerNode=65556
DefCpuPerGPU=8 DefMemPerGPU=65556 MaxMemPerNode=532000 MaxTime=1-00:00:00
State=UP Nodes=nodeGPU01  Default=YES
PartitionName=cpu OverSubscribe=No MaxCPUsPerNode=64 DefMemPerNode=16384
MaxMemPerNode=420000 MaxTime=1-00:00:00 State=UP Nodes=nodeGPU01

On Tue, Apr 13, 2021 at 9:38 PM Cristóbal Navarro <
cristobal.navarro.g at gmail.com> wrote:

> Hi Sean,
> Sorry for the delay,
> The problem got solved accidentally by restarting the slurm services on
> the head node.
> Maybe it was an unfortunate combination of changes done, for which I was
> assuming "scontrol reconfigure" would apply them all properly.
>
> Anyways, I will follow your advice and try changing to to "cons_tres"
> plugin
> Will post back with the result.
> best and many thanks
>
> On Mon, Apr 12, 2021 at 6:35 AM Sean Crosby <scrosby at unimelb.edu.au>
> wrote:
>
>> Hi Cristobal,
>>
>> The weird stuff I see in your job is
>>
>> [2021-04-11T01:12:23.270] gres:gpu(7696487) type:(null)(0) job:1317
>> flags: state
>> [2021-04-11T01:12:23.270]   gres_per_node:1 node_cnt:0
>> [2021-04-11T01:12:23.270]   ntasks_per_gres:65534
>>
>> Not sure why ntasks_per_gres is 65534 and node_cnt is 0.
>>
>> Can you try
>>
>> srun --gres=gpu:A100:1 --mem=10G --cpus-per-gpu=1 --nodes=1 nvidia-smi
>>
>> and post the output of slurmctld.log?
>>
>> I also recommend changing from cons_res to cons_tres for SelectType
>>
>> e.g.
>>
>> SelectType=select/cons_tres
>> SelectTypeParameters=CR_Core_Memory,CR_ONE_TASK_PER_CORE
>>
>> Sean
>>
>> --
>> Sean Crosby | Senior DevOpsHPC Engineer and HPC Team Lead
>> Research Computing Services | Business Services
>> The University of Melbourne, Victoria 3010 Australia
>>
>>
>>
>> On Mon, 12 Apr 2021 at 00:18, Cristóbal Navarro <
>> cristobal.navarro.g at gmail.com> wrote:
>>
>>> * UoM notice: External email. Be cautious of links, attachments, or
>>> impersonation attempts *
>>> ------------------------------
>>> Hi Sean,
>>> Tried as suggested but still getting the same error.
>>> This is the node configuration visible to 'scontrol' just in case
>>> ➜  scontrol show node
>>> NodeName=nodeGPU01 Arch=x86_64 CoresPerSocket=16
>>>    CPUAlloc=0 CPUTot=256 CPULoad=8.07
>>>    AvailableFeatures=ht,gpu
>>>    ActiveFeatures=ht,gpu
>>>    Gres=gpu:A100:8
>>>    NodeAddr=nodeGPU01 NodeHostName=nodeGPU01 Version=20.11.2
>>>    OS=Linux 5.4.0-66-generic #74-Ubuntu SMP Wed Jan 27 22:54:38 UTC 2021
>>>    RealMemory=1024000 AllocMem=0 FreeMem=1019774 Sockets=8 Boards=1
>>>    State=IDLE ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
>>>    Partitions=gpu,cpu
>>>    BootTime=2021-04-09T21:23:14 SlurmdStartTime=2021-04-11T10:11:12
>>>    CfgTRES=cpu=256,mem=1000G,billing=256
>>>    AllocTRES=
>>>    CapWatts=n/a
>>>    CurrentWatts=0 AveWatts=0
>>>    ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
>>>    Comment=(null)
>>>
>>>
>>>
>>>
>>> On Sun, Apr 11, 2021 at 2:03 AM Sean Crosby <scrosby at unimelb.edu.au>
>>> wrote:
>>>
>>>> Hi Cristobal,
>>>>
>>>> My hunch is it is due to the default memory/CPU settings.
>>>>
>>>> Does it work if you do
>>>>
>>>> srun --gres=gpu:A100:1 --cpus-per-task=1 --mem=10G nvidia-smi
>>>>
>>>> Sean
>>>> --
>>>> Sean Crosby | Senior DevOpsHPC Engineer and HPC Team Lead
>>>> Research Computing Services | Business Services
>>>> The University of Melbourne, Victoria 3010 Australia
>>>>
>>>>
>>>>
>>>> On Sun, 11 Apr 2021 at 15:26, Cristóbal Navarro <
>>>> cristobal.navarro.g at gmail.com> wrote:
>>>>
>>>>> * UoM notice: External email. Be cautious of links, attachments, or
>>>>> impersonation attempts *
>>>>> ------------------------------
>>>>> Hi Community,
>>>>> These last two days I've been trying to understand what is the cause
>>>>> of the "Unable to allocate resources" error I keep getting when specifying
>>>>> --gres=...  in a srun command (or sbatch). It fails with the error
>>>>> ➜  srun --gres=gpu:A100:1 nvidia-smi
>>>>> srun: error: Unable to allocate resources: Requested node
>>>>> configuration is not available
>>>>>
>>>>> log file on the master node (not the compute one)
>>>>> ➜  tail -f /var/log/slurm/slurmctld.log
>>>>> [2021-04-11T01:12:23.270] gres:gpu(7696487) type:(null)(0) job:1317
>>>>> flags: state
>>>>> [2021-04-11T01:12:23.270]   gres_per_node:1 node_cnt:0
>>>>> [2021-04-11T01:12:23.270]   ntasks_per_gres:65534
>>>>> [2021-04-11T01:12:23.270] select/cons_res: common_job_test: no
>>>>> job_resources info for JobId=1317 rc=-1
>>>>> [2021-04-11T01:12:23.270] select/cons_res: common_job_test: no
>>>>> job_resources info for JobId=1317 rc=-1
>>>>> [2021-04-11T01:12:23.270] select/cons_res: common_job_test: no
>>>>> job_resources info for JobId=1317 rc=-1
>>>>> [2021-04-11T01:12:23.271] _pick_best_nodes: JobId=1317 never runnable
>>>>> in partition gpu
>>>>> [2021-04-11T01:12:23.271] _slurm_rpc_allocate_resources: Requested
>>>>> node configuration is not available
>>>>>
>>>>> If launched without --gres, it allocates all GPUs by default and
>>>>> nvidia-smi does work, in fact our CUDA programs do work via SLURM if --gres
>>>>> is not specified.
>>>>> ➜  TUT04-GPU-multi git:(master) ✗ srun nvidia-smi
>>>>> Sun Apr 11 01:05:47 2021
>>>>>
>>>>> +-----------------------------------------------------------------------------+
>>>>> | NVIDIA-SMI 450.102.04   Driver Version: 450.102.04   CUDA Version:
>>>>> 11.0     |
>>>>>
>>>>> |-------------------------------+----------------------+----------------------+
>>>>> | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile
>>>>> Uncorr. ECC |
>>>>> | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util
>>>>>  Compute M. |
>>>>> |                               |                      |
>>>>> MIG M. |
>>>>>
>>>>> |===============================+======================+======================|
>>>>> |   0  A100-SXM4-40GB      On   | 00000000:07:00.0 Off |
>>>>>      0 |
>>>>> | N/A   31C    P0    51W / 400W |      0MiB / 40537MiB |      0%
>>>>>  Default |
>>>>> |                               |                      |
>>>>> Disabled |
>>>>> ....
>>>>> ....
>>>>>
>>>>> There is only one DGX A100 Compute node with 8 GPUs and 2x 64-core
>>>>> CPUs, and the gres.conf file simply is (also tried the commented lines):
>>>>> ➜  ~ cat /etc/slurm/gres.conf
>>>>> # GRES configuration for native GPUS
>>>>> # DGX A100 8x Nvidia A100
>>>>> #AutoDetect=nvml
>>>>> Name=gpu Type=A100 File=/dev/nvidia[0-7]
>>>>>
>>>>> #Name=gpu Type=A100 File=/dev/nvidia0 Cores=0-7
>>>>> #Name=gpu Type=A100 File=/dev/nvidia1 Cores=8-15
>>>>> #Name=gpu Type=A100 File=/dev/nvidia2 Cores=16-23
>>>>> #Name=gpu Type=A100 File=/dev/nvidia3 Cores=24-31
>>>>> #Name=gpu Type=A100 File=/dev/nvidia4 Cores=32-39
>>>>> #Name=gpu Type=A100 File=/dev/nvidia5 Cores=40-47
>>>>> #Name=gpu Type=A100 File=/dev/nvidia6 Cores=48-55
>>>>> #Name=gpu Type=A100 File=/dev/nvidia7 Cores=56-63
>>>>>
>>>>>
>>>>> Some relevant parts of the slurm.conf file
>>>>> ➜  cat /etc/slurm/slurm.conf
>>>>> ...
>>>>> ## GRES
>>>>> GresTypes=gpu
>>>>> AccountingStorageTRES=gres/gpu
>>>>> DebugFlags=CPU_Bind,gres
>>>>> ...
>>>>> ## Nodes list
>>>>> ## Default CPU layout, native GPUs
>>>>> NodeName=nodeGPU01 SocketsPerBoard=8 CoresPerSocket=16
>>>>> ThreadsPerCore=2 RealMemory=1024000 State=UNKNOWN Gres=gpu:A100:8
>>>>> Feature=ht,gpu
>>>>> ...
>>>>> ## Partitions list
>>>>> PartitionName=gpu OverSubscribe=FORCE MaxCPUsPerNode=128
>>>>> MaxTime=INFINITE State=UP Nodes=nodeGPU01  Default=YES
>>>>> PartitionName=cpu OverSubscribe=FORCE MaxCPUsPerNode=128
>>>>> MaxTime=INFINITE State=UP Nodes=nodeGPU01
>>>>>
>>>>> Any ideas where should I check?
>>>>> thanks in advance
>>>>> --
>>>>> Cristóbal A. Navarro
>>>>>
>>>>
>>>
>>> --
>>> Cristóbal A. Navarro
>>>
>>
>
> --
> Cristóbal A. Navarro
>

-- 
Cristóbal A. Navarro
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20210513/f3cc6b99/attachment-0001.htm>