[slurm-users] [EXT] [Beginner, SLURM 20.11.2] Unable to allocate resources when specifying gres in srun or sbatch

Thu May 20 14:45:58 UTC 2021

Hi Community,
just wanted to share that this problem got solved with the help of pyxis
developers
https://github.com/NVIDIA/pyxis/issues/47

The solution was to add
ConstrainDevices=yes
as it was missing in the cgroup.conf file

On Thu, May 13, 2021 at 5:14 PM Cristóbal Navarro <
cristobal.navarro.g at gmail.com> wrote:

> Hi Sean and Community,
> Some days ago I changed to the cons_tres plugin and also made
> AutoDetect=nvml work for gres.conf (attached at the end of the email), Node
> and partition definitions seem to be OK (attached at the end as well).
> I believe the SLURM setup is just a few steps of being properly set up,
> currently I have two very basic scenarios that are giving me
> questions/problems, :
>
> *For 1) Running GPU jobs without containers*:
> I was expecting that when doing for example "srun -p gpu --gres=gpu:A100:1
> nvidia-smi -L", the output would be just 1 GPU. However it is not the case.
> ➜  TUT03-GPU-single srun -p gpu --gres=gpu:A100:1 nvidia-smi -L
> GPU 0: A100-SXM4-40GB (UUID: GPU-baa4736e-088f-77ce-0290-ba745327ca95)
> GPU 1: A100-SXM4-40GB (UUID: GPU-d40a3b1b-006b-37de-8b72-669c59d14954)
> GPU 2: A100-SXM4-40GB (UUID: GPU-35a012ac-2b34-b68f-d922-24aa07af1be6)
> GPU 3: A100-SXM4-40GB (UUID: GPU-b75a4bf8-123b-a8c0-dc75-7709626ead20)
> GPU 4: A100-SXM4-40GB (UUID: GPU-9366ff9f-a20a-004e-36eb-8376655b1419)
> GPU 5: A100-SXM4-40GB (UUID: GPU-75da7cd5-daf3-10fd-2c3f-56259c1dc777)
> GPU 6: A100-SXM4-40GB (UUID: GPU-f999e415-54e5-9d7f-0c4b-1d4d98a1dbfc)
> GPU 7: A100-SXM4-40GB (UUID: GPU-cce4a787-1b22-bed7-1e93-612906567a0e)
>
> But still, when opening an interactive session It really provides 1 GPU.
> ➜  TUT03-GPU-single srun -p gpu --gres=gpu:A100:1 --pty bash
>
> user at nodeGPU01:$ echo $CUDA_VISIBLE_DEVICES
> 2
>
> Moreover, I tried running simultaneous jobs, each one with
> --gres=gpu:A100:1 and the source code logically choosing GPU ID 0,  and
> indeed different physical GPUs get used which is great. My only concern
> here for *1) *is that list that is always displaying all of the devices.
> It could confuse users, making them think they have all those GPUs at their
> disposal leading to take wrong decisions. Nevertheless, this issue is not
> critical compared to the next one.
>
> *2) Running GPU jobs with containers (pyxis + enroot)*
> For this case, the list of GPUs does get reduced to the number of select
> devices with gres, however there seems to be a problem when referring to
> GPU IDs from inside the container, and the mapping to the physical GPUs,
> giving a runtime error in CUDA.
>
> Doing nvidia-smi gives
> ➜  TUT03-GPU-single srun -p gpu --container-name=cuda-11.2.2
> --container-image=cuda-11.2.2 --pty --gres=gpu:A100:1 nvidia-smi -L
>
> GPU 0: A100-SXM4-40GB (UUID: GPU-35a012ac-2b34-b68f-d922-24aa07af1be6)
> As we can see, physical GPU2 is allocated (we can check with the UUID).
> From what I understand from the idea of SLURM, the programmer does not need
> to know that this is GPU ID 2, he/she can just develop a program thinking
> on GPU ID 0 because there is only 1 GPU allocated. That is how it worked in
> case 1), otherwise one could not know which GPU ID is the one available.
>
> Now, If I launch a job with --gres=gpu:A100:1,something like a CUDA matrix
> multiply with some nvml info printed I get
> ➜  TUT03-GPU-single srun -p gpu --container-name=cuda-11.2.2
> --container-image=cuda-11.2.2 --pty --gres=gpu:A100:1 ./prog 0 $((1024*40))
> 1
> Driver version: 450.102.04
> NUM GPUS = 1
> Listing devices:
> GPU0 A100-SXM4-40GB, index=0,
> UUID=GPU-35a012ac-2b34-b68f-d922-24aa07af1be6  -> util = 0%
> Choosing GPU 0
> GPUassert: no CUDA-capable device is detected main.cu 112
> srun: error: nodeGPU01: task 0: Exited with exit code 100
>
> the "index=.." is the GPU index given by nvml.
> Now If I do --gres=gpu:A100:3,  the real first GPU gets allocated, and the
> program works, but It is not the way in which SLURM should work.
> ➜  TUT03-GPU-single srun -p gpu --container-name=cuda-11.2.2
> --container-image=cuda-11.2.2 --pty --gres=gpu:A100:3 ./prog 0 $((1024*40))
> 1
> Driver version: 450.102.04
> NUM GPUS = 3
> Listing devices:
> GPU0 A100-SXM4-40GB, index=0,
> UUID=GPU-baa4736e-088f-77ce-0290-ba745327ca95  -> util = 0%
> GPU1 A100-SXM4-40GB, index=1,
> UUID=GPU-35a012ac-2b34-b68f-d922-24aa07af1be6  -> util = 0%
> GPU2 A100-SXM4-40GB, index=2,
> UUID=GPU-b75a4bf8-123b-a8c0-dc75-7709626ead20  -> util = 0%
> Choosing GPU 0
> initializing A and B.......done
> matmul shared mem..........done: time: 26.546274 secs
> copying result to host.....done
> verifying result...........done
>
> I find that very strange that when using containers, the GPU0 from inside
> the JOB seems to be trying to access the real physical GPU0 from the
> machine, and not the GPU0 provided by SLURM as in 1) which worked well.
>
> If anyone has advice where to look for any of the two issues, I would
> really appreciate it
> Many thanks in advance and sorry for this long email.
> -- Cristobal
>
>
> ---------------------
> CONFIG FILES
> *# gres.conf*
> ➜  ~ cat /etc/slurm/gres.conf
> AutoDetect=nvml
>
>
>
> *# slurm.conf*
>
> *....*
> ## Basic scheduling
> SelectType=select/cons_tres
> SelectTypeParameters=CR_Core_Memory,CR_ONE_TASK_PER_CORE
> SchedulerType=sched/backfill
>
> ## Accounting
> AccountingStorageType=accounting_storage/slurmdbd
> AccountingStoreJobComment=YES
> JobAcctGatherFrequency=30
> JobAcctGatherType=jobacct_gather/linux
> AccountingStorageHost=10.10.0.1
>
> TaskPlugin=task/cgroup
> ProctrackType=proctrack/cgroup
>
> ## scripts
> Epilog=/etc/slurm/epilog
> Prolog=/etc/slurm/prolog
> PrologFlags=Alloc
>
> ## Nodes list
> NodeName=nodeGPU01 SocketsPerBoard=8 CoresPerSocket=16 ThreadsPerCore=1
> RealMemory=1024000 MemSpecLimit=65556 State=UNKNOWN Gres=gpu:A100:8
> Feature=gpu
>
> ## Partitions list
> PartitionName=gpu OverSubscribe=No MaxCPUsPerNode=64 DefMemPerNode=65556
> DefCpuPerGPU=8 DefMemPerGPU=65556 MaxMemPerNode=532000 MaxTime=1-00:00:00
> State=UP Nodes=nodeGPU01  Default=YES
> PartitionName=cpu OverSubscribe=No MaxCPUsPerNode=64 DefMemPerNode=16384
> MaxMemPerNode=420000 MaxTime=1-00:00:00 State=UP Nodes=nodeGPU01
>
> On Tue, Apr 13, 2021 at 9:38 PM Cristóbal Navarro <
> cristobal.navarro.g at gmail.com> wrote:
>
>> Hi Sean,
>> Sorry for the delay,
>> The problem got solved accidentally by restarting the slurm services on
>> the head node.
>> Maybe it was an unfortunate combination of changes done, for which I was
>> assuming "scontrol reconfigure" would apply them all properly.
>>
>> Anyways, I will follow your advice and try changing to to "cons_tres"
>> plugin
>> Will post back with the result.
>> best and many thanks
>>
>> On Mon, Apr 12, 2021 at 6:35 AM Sean Crosby <scrosby at unimelb.edu.au>
>> wrote:
>>
>>> Hi Cristobal,
>>>
>>> The weird stuff I see in your job is
>>>
>>> [2021-04-11T01:12:23.270] gres:gpu(7696487) type:(null)(0) job:1317
>>> flags: state
>>> [2021-04-11T01:12:23.270]   gres_per_node:1 node_cnt:0
>>> [2021-04-11T01:12:23.270]   ntasks_per_gres:65534
>>>
>>> Not sure why ntasks_per_gres is 65534 and node_cnt is 0.
>>>
>>> Can you try
>>>
>>> srun --gres=gpu:A100:1 --mem=10G --cpus-per-gpu=1 --nodes=1 nvidia-smi
>>>
>>> and post the output of slurmctld.log?
>>>
>>> I also recommend changing from cons_res to cons_tres for SelectType
>>>
>>> e.g.
>>>
>>> SelectType=select/cons_tres
>>> SelectTypeParameters=CR_Core_Memory,CR_ONE_TASK_PER_CORE
>>>
>>> Sean
>>>
>>> --
>>> Sean Crosby | Senior DevOpsHPC Engineer and HPC Team Lead
>>> Research Computing Services | Business Services
>>> The University of Melbourne, Victoria 3010 Australia
>>>
>>>
>>>
>>> On Mon, 12 Apr 2021 at 00:18, Cristóbal Navarro <
>>> cristobal.navarro.g at gmail.com> wrote:
>>>
>>>> * UoM notice: External email. Be cautious of links, attachments, or
>>>> impersonation attempts *
>>>> ------------------------------
>>>> Hi Sean,
>>>> Tried as suggested but still getting the same error.
>>>> This is the node configuration visible to 'scontrol' just in case
>>>> ➜  scontrol show node
>>>> NodeName=nodeGPU01 Arch=x86_64 CoresPerSocket=16
>>>>    CPUAlloc=0 CPUTot=256 CPULoad=8.07
>>>>    AvailableFeatures=ht,gpu
>>>>    ActiveFeatures=ht,gpu
>>>>    Gres=gpu:A100:8
>>>>    NodeAddr=nodeGPU01 NodeHostName=nodeGPU01 Version=20.11.2
>>>>    OS=Linux 5.4.0-66-generic #74-Ubuntu SMP Wed Jan 27 22:54:38 UTC
>>>> 2021
>>>>    RealMemory=1024000 AllocMem=0 FreeMem=1019774 Sockets=8 Boards=1
>>>>    State=IDLE ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A
>>>> MCS_label=N/A
>>>>    Partitions=gpu,cpu
>>>>    BootTime=2021-04-09T21:23:14 SlurmdStartTime=2021-04-11T10:11:12
>>>>    CfgTRES=cpu=256,mem=1000G,billing=256
>>>>    AllocTRES=
>>>>    CapWatts=n/a
>>>>    CurrentWatts=0 AveWatts=0
>>>>    ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
>>>>    Comment=(null)
>>>>
>>>>
>>>>
>>>>
>>>> On Sun, Apr 11, 2021 at 2:03 AM Sean Crosby <scrosby at unimelb.edu.au>
>>>> wrote:
>>>>
>>>>> Hi Cristobal,
>>>>>
>>>>> My hunch is it is due to the default memory/CPU settings.
>>>>>
>>>>> Does it work if you do
>>>>>
>>>>> srun --gres=gpu:A100:1 --cpus-per-task=1 --mem=10G nvidia-smi
>>>>>
>>>>> Sean
>>>>> --
>>>>> Sean Crosby | Senior DevOpsHPC Engineer and HPC Team Lead
>>>>> Research Computing Services | Business Services
>>>>> The University of Melbourne, Victoria 3010 Australia
>>>>>
>>>>>
>>>>>
>>>>> On Sun, 11 Apr 2021 at 15:26, Cristóbal Navarro <
>>>>> cristobal.navarro.g at gmail.com> wrote:
>>>>>
>>>>>> * UoM notice: External email. Be cautious of links, attachments, or
>>>>>> impersonation attempts *
>>>>>> ------------------------------
>>>>>> Hi Community,
>>>>>> These last two days I've been trying to understand what is the cause
>>>>>> of the "Unable to allocate resources" error I keep getting when specifying
>>>>>> --gres=...  in a srun command (or sbatch). It fails with the error
>>>>>> ➜  srun --gres=gpu:A100:1 nvidia-smi
>>>>>> srun: error: Unable to allocate resources: Requested node
>>>>>> configuration is not available
>>>>>>
>>>>>> log file on the master node (not the compute one)
>>>>>> ➜  tail -f /var/log/slurm/slurmctld.log
>>>>>> [2021-04-11T01:12:23.270] gres:gpu(7696487) type:(null)(0) job:1317
>>>>>> flags: state
>>>>>> [2021-04-11T01:12:23.270]   gres_per_node:1 node_cnt:0
>>>>>> [2021-04-11T01:12:23.270]   ntasks_per_gres:65534
>>>>>> [2021-04-11T01:12:23.270] select/cons_res: common_job_test: no
>>>>>> job_resources info for JobId=1317 rc=-1
>>>>>> [2021-04-11T01:12:23.270] select/cons_res: common_job_test: no
>>>>>> job_resources info for JobId=1317 rc=-1
>>>>>> [2021-04-11T01:12:23.270] select/cons_res: common_job_test: no
>>>>>> job_resources info for JobId=1317 rc=-1
>>>>>> [2021-04-11T01:12:23.271] _pick_best_nodes: JobId=1317 never runnable
>>>>>> in partition gpu
>>>>>> [2021-04-11T01:12:23.271] _slurm_rpc_allocate_resources: Requested
>>>>>> node configuration is not available
>>>>>>
>>>>>> If launched without --gres, it allocates all GPUs by default and
>>>>>> nvidia-smi does work, in fact our CUDA programs do work via SLURM if --gres
>>>>>> is not specified.
>>>>>> ➜  TUT04-GPU-multi git:(master) ✗ srun nvidia-smi
>>>>>> Sun Apr 11 01:05:47 2021
>>>>>>
>>>>>> +-----------------------------------------------------------------------------+
>>>>>> | NVIDIA-SMI 450.102.04   Driver Version: 450.102.04   CUDA Version:
>>>>>> 11.0     |
>>>>>>
>>>>>> |-------------------------------+----------------------+----------------------+
>>>>>> | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile
>>>>>> Uncorr. ECC |
>>>>>> | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util
>>>>>>  Compute M. |
>>>>>> |                               |                      |
>>>>>>   MIG M. |
>>>>>>
>>>>>> |===============================+======================+======================|
>>>>>> |   0  A100-SXM4-40GB      On   | 00000000:07:00.0 Off |
>>>>>>        0 |
>>>>>> | N/A   31C    P0    51W / 400W |      0MiB / 40537MiB |      0%
>>>>>>  Default |
>>>>>> |                               |                      |
>>>>>> Disabled |
>>>>>> ....
>>>>>> ....
>>>>>>
>>>>>> There is only one DGX A100 Compute node with 8 GPUs and 2x 64-core
>>>>>> CPUs, and the gres.conf file simply is (also tried the commented lines):
>>>>>> ➜  ~ cat /etc/slurm/gres.conf
>>>>>> # GRES configuration for native GPUS
>>>>>> # DGX A100 8x Nvidia A100
>>>>>> #AutoDetect=nvml
>>>>>> Name=gpu Type=A100 File=/dev/nvidia[0-7]
>>>>>>
>>>>>> #Name=gpu Type=A100 File=/dev/nvidia0 Cores=0-7
>>>>>> #Name=gpu Type=A100 File=/dev/nvidia1 Cores=8-15
>>>>>> #Name=gpu Type=A100 File=/dev/nvidia2 Cores=16-23
>>>>>> #Name=gpu Type=A100 File=/dev/nvidia3 Cores=24-31
>>>>>> #Name=gpu Type=A100 File=/dev/nvidia4 Cores=32-39
>>>>>> #Name=gpu Type=A100 File=/dev/nvidia5 Cores=40-47
>>>>>> #Name=gpu Type=A100 File=/dev/nvidia6 Cores=48-55
>>>>>> #Name=gpu Type=A100 File=/dev/nvidia7 Cores=56-63
>>>>>>
>>>>>>
>>>>>> Some relevant parts of the slurm.conf file
>>>>>> ➜  cat /etc/slurm/slurm.conf
>>>>>> ...
>>>>>> ## GRES
>>>>>> GresTypes=gpu
>>>>>> AccountingStorageTRES=gres/gpu
>>>>>> DebugFlags=CPU_Bind,gres
>>>>>> ...
>>>>>> ## Nodes list
>>>>>> ## Default CPU layout, native GPUs
>>>>>> NodeName=nodeGPU01 SocketsPerBoard=8 CoresPerSocket=16
>>>>>> ThreadsPerCore=2 RealMemory=1024000 State=UNKNOWN Gres=gpu:A100:8
>>>>>> Feature=ht,gpu
>>>>>> ...
>>>>>> ## Partitions list
>>>>>> PartitionName=gpu OverSubscribe=FORCE MaxCPUsPerNode=128
>>>>>> MaxTime=INFINITE State=UP Nodes=nodeGPU01  Default=YES
>>>>>> PartitionName=cpu OverSubscribe=FORCE MaxCPUsPerNode=128
>>>>>> MaxTime=INFINITE State=UP Nodes=nodeGPU01
>>>>>>
>>>>>> Any ideas where should I check?
>>>>>> thanks in advance
>>>>>> --
>>>>>> Cristóbal A. Navarro
>>>>>>
>>>>>
>>>>
>>>> --
>>>> Cristóbal A. Navarro
>>>>
>>>
>>
>> --
>> Cristóbal A. Navarro
>>
>
>
> --
> Cristóbal A. Navarro
>

-- 
Cristóbal A. Navarro
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20210520/c586c40a/attachment-0001.htm>