[slurm-users] Gres GPU Resource Issue

Alex Chekholko alex at calicolabs.com
Sun May 17 16:05:42 UTC 2020


Hi Andrew,

I think maybe something is wrong with your slurmd, maybe something missing
from your install?

On the node (where slurmd is running), you should see a message similar to
this in slurmd.log

[2020-05-11T14:29:17.766] Gres Name=gpu Type=titanrtx Count=4 ID=7696487
File=/dev/nvidia[0-3] (null)

Regards,
Alex

On Fri, May 15, 2020 at 2:52 PM Speer, Andrew <aspeer at siue.edu> wrote:

> I've run into a bit of an issue when trying to define GPU's in our slurm
> conf. Any insight is appreciated.
> Hopefully relevant lines from the configs below.
>
> Error:
> [2020-05-15T16:35:14.862] error: gres_plugin_node_config_unpack: No plugin
> configured to process GRES data from node node3 (Name:gpu Type:(null)
> PluginID:7696487 Count:2)
> [2020-05-15T16:35:15.321] error: gres_plugin_node_config_unpack: No plugin
> configured to process GRES data from node node4 (Name:gpu Type:(null)
> PluginID:7696487 Count:1)
> [2020-05-15T16:35:15.738] error: gres_plugin_node_config_unpack: No plugin
> configured to process GRES data from node node5 (Name:gpu Type:(null)
> PluginID:7696487 Count:1)
> [2020-05-15T16:35:16.229] error: gres_plugin_node_config_unpack: No plugin
> configured to process GRES data from node node6 (Name:gpu Type:(null)
> PluginID:7696487 Count:1)
>
> /etc/slurm/slurm.conf:
> GresTypes=gpu
> NodeName=node[1-3]      CPUs=40 RealMemory=48000 Sockets=2
> CoresPerSocket=10 ThreadsPerCore=2 Feature="pascal,p4000" Gres=gpu:8
> State=UNKNOWN
> NodeName=node[4-5,7-10] CPUs=8  RealMemory=48000 Sockets=2
> CoresPerSocket=4  ThreadsPerCore=1 Feature="pascal,p1000" Gres=gpu:8
> State=UNKNOWN
> NodeName=node[6]        CPUs=24 RealMemory=30000 Sockets=2
> CoresPerSocket=6  ThreadsPerCore=2 Feature="pascal,p1000" Gres=gpu:8
> State=UNKNOWN
>
> /etc/slurm/gres.conf
> NodeName=node[1-3]  Name=gpu File=/dev/nvidia[0-7]
> NodeName=node[4-10] Name=gpu File=/dev/nvidia[0-4]
>
> scontrol show node node1
> NodeName=node1 Arch=x86_64 CoresPerSocket=10
>    CPUAlloc=0 CPUTot=40 CPULoad=1.75
>    AvailableFeatures=pascal,p4000
>    ActiveFeatures=pascal,p4000
>    Gres=(null)                               <------------------------
>    NodeAddr=node1 NodeHostName=node1
>    OS=Linux 3.10.0-1062.9.1.el7.x86_64 #1 SMP Fri Dec 6 15:49:49 UTC 2019
>    RealMemory=48000 AllocMem=0 FreeMem=57465 Sockets=2 Boards=1
>    State=IDLE ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
>    Partitions=pharmacy
>    BootTime=2020-05-15T09:26:45 SlurmdStartTime=2020-05-15T16:35:13
>    CfgTRES=cpu=40,mem=48000M,billing=40
>    AllocTRES=
>    CapWatts=n/a
>    CurrentWatts=0 AveWatts=0
>    ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20200517/f6f54bc8/attachment.htm>


More information about the slurm-users mailing list