[slurm-users] "Low socket*core*thre" - solution?

John Kelly john.kelly at broadcom.com
Wed May 2 19:33:37 MDT 2018


Hi Caleb

I noticed the same thing.   If you configure a host with more memory than
it really has slurm will think that the host has something wrong with it
and put it in drain status.   At least that is my theory.  The vendor can
likely give you a better more detailed answer.


-jfk

On Wed, May 2, 2018 at 6:23 PM, Caleb Smith <caleb at calebsmith.net> wrote:

> Hi all,
>
> Out of curiosity, what causes that? It'd be good to know for the future --
> I ran into the same issue and just edited the memory down and it works fine
> now, but I'd like to know why/what causes that error. I'm assuming low
> resources, ie memory or CPU or whatever. Mind clarifying?
>
> On Wed, May 2, 2018, 7:11 PM John Kelly <john.kelly at broadcom.com> wrote:
>
>> Hi matt
>>
>> scontrol update nodename=odin state=resume
>> scontrol update nodename=odin state=idle
>>
>> -jfk
>>
>>
>>
>> On Wed, May 2, 2018 at 5:28 PM, Matt Hohmeister <hohmeister at psy.fsu.edu>
>> wrote:
>>
>>> I have a two-node cluster: the server/compute node is a Dell PowerEdge
>>> R730; the compute node, a Dell PowerEdge R630. On both of these nodes, slurmd
>>> -C gives me the exact same line:
>>>
>>>
>>>
>>> [me at odin slurm]$ slurmd -C
>>>
>>> NodeName=odin CPUs=40 Boards=1 SocketsPerBoard=2 CoresPerSocket=10
>>> ThreadsPerCore=2 RealMemory=128655
>>>
>>>
>>>
>>> [me at thor slurm]$ slurmd -C
>>>
>>> NodeName=thor CPUs=40 Boards=1 SocketsPerBoard=2 CoresPerSocket=10
>>> ThreadsPerCore=2 RealMemory=128655
>>>
>>>
>>>
>>> So I edited my slurm.conf appropriately:
>>>
>>>
>>>
>>> NodeName=odin CPUs=40 Boards=1 SocketsPerBoard=2 CoresPerSocket=10
>>> ThreadsPerCore=2 RealMemory=128655
>>>
>>> NodeName=thor CPUs=40 Boards=1 SocketsPerBoard=2 CoresPerSocket=10
>>> ThreadsPerCore=2 RealMemory=128655
>>>
>>>
>>>
>>> …and it looks good, except for the drain on my server/compute node:
>>>
>>>
>>>
>>> [me at odin slurm]$ sinfo
>>>
>>> PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
>>>
>>> debug*       up   infinite      1  drain odin
>>>
>>> debug*       up   infinite      1   idle thor
>>>
>>>
>>>
>>> …for the following reason:
>>>
>>>
>>>
>>> [me at odin slurm]$ sinfo -R
>>>
>>> REASON               USER      TIMESTAMP           NODELIST
>>>
>>> Low socket*core*thre slurm     2018-05-02T11:55:38 odin
>>>
>>>
>>>
>>> Any ideas?
>>>
>>>
>>>
>>> Thanks!
>>>
>>>
>>>
>>> Matt Hohmeister
>>>
>>> Systems and Network Administrator
>>>
>>> Department of Psychology
>>>
>>> Florida State University
>>>
>>> PO Box 3064301
>>>
>>> Tallahassee, FL 32306-4301
>>>
>>> Phone: +1 850 645 1902
>>>
>>> Fax: +1 850 644 7739
>>>
>>>
>>>
>>
>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20180502/1a5169aa/attachment.html>


More information about the slurm-users mailing list