<div dir="ltr">We are on a Bright Cluster and their support says the head node controls this. Here you can see the sym links:<div><br>[root@node001 ~]# file /etc/slurm/slurm.conf<br>/etc/slurm/slurm.conf: symbolic link to `/cm/shared/apps/slurm/var/etc/slurm.conf'<br><br>[root@ourcluster myuser]# file /etc/slurm/slurm.conf<br>/etc/slurm/slurm.conf: symbolic link to `/cm/shared/apps/slurm/var/etc/slurm.conf'<br></div><div><br></div><div> ls -l /etc/slurm/slurm.conf<br>lrwxrwxrwx 1 root root 40 Nov 30 2018 /etc/slurm/slurm.conf -> /cm/shared/apps/slurm/var/etc/slurm.conf<br>[root@ourcluster myuser]# ssh node001<br>Last login: Mon Jan 20 14:02:00 2020<br>[root@node001 ~]# ls -l /etc/slurm/slurm.conf<br>lrwxrwxrwx 1 root root 40 Nov 30 2018 /etc/slurm/slurm.conf -> /cm/shared/apps/slurm/var/etc/slurm.conf<br></div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Mon, Jan 20, 2020 at 1:52 PM Brian Andrus <<a href="mailto:toomuchit@gmail.com">toomuchit@gmail.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<div>
<p>Try using "nodename=node003" in the slurm.conf on your nodes.</p>
<p>Also, make sure the slurm.conf on the nodes is the same as on the
head.<br>
</p>
<p>Somewhere in there, you have "node=node003" (as well as the other
nodes names).<br>
</p>
<p>That may even do it, as they may be trying to register
generically, so their configs are not getting matched to the
specific info in your main config</p>
<p>Brian Andrus<br>
</p>
<p><br>
</p>
<div>On 1/20/2020 10:37 AM, Robert Kudyba
wrote:<br>
</div>
<blockquote type="cite">
<div dir="ltr">I've posted about this previously <a href="https://urldefense.proofpoint.com/v2/url?u=https-3A__groups.google.com_forum_-23-21searchin_slurm-2Dusers_kudyba-257Csort-3Adate_slurm-2Dusers_mMECjerUmFE_V1wK19fFAQAJ&d=DwMDaQ&c=aqMfXOEvEJQh2iQMCb7Wy8l0sPnURkcqADc2guUW8IM&r=X0jL9y0sL4r4iU_qVtR3lLNo4tOL1ry_m7-psV3GejY&m=536v1kqVHYCPVjdMowh4_kfCXSihJp1LwoDKM8FWu08&s=V4tz7Qab3oK28vrC090A6R6aFEaDXz7Czqr5y2eDUk0&e=" target="_blank">here</a>, and <a href="https://urldefense.proofpoint.com/v2/url?u=https-3A__groups.google.com_forum_-23-21searchin_slurm-2Dusers_kudyba-257Csort-3Adate_slurm-2Dusers_vVAyqm0wg3Y_2YoBq744AAAJ&d=DwMDaQ&c=aqMfXOEvEJQh2iQMCb7Wy8l0sPnURkcqADc2guUW8IM&r=X0jL9y0sL4r4iU_qVtR3lLNo4tOL1ry_m7-psV3GejY&m=536v1kqVHYCPVjdMowh4_kfCXSihJp1LwoDKM8FWu08&s=eEetgW964TvhYChxX27f_Bjz3tn5UlwUpVEVAZIdIKo&e=" target="_blank">here</a> so I'm trying to get to the
bottom of this once and for all and even got <a href="https://urldefense.proofpoint.com/v2/url?u=https-3A__groups.google.com_d_msg_slurm-2Dusers_vVAyqm0wg3Y_x9-2D-5FiQQaBwAJ&d=DwMDaQ&c=aqMfXOEvEJQh2iQMCb7Wy8l0sPnURkcqADc2guUW8IM&r=X0jL9y0sL4r4iU_qVtR3lLNo4tOL1ry_m7-psV3GejY&m=536v1kqVHYCPVjdMowh4_kfCXSihJp1LwoDKM8FWu08&s=5UB2Ohj42gVpQ0GXneP02dO3kpRATj5OvQ4nmNTWZd4&e=" target="_blank">this comment</a> previously:
<div><br>
</div>
<div>
<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">our problem here is that
the configuration for the nodes in question have an
incorrect amount of memory set for them. Looks like you have
it set in bytes instead of megabytes<br>
In your slurm.conf you should look at the RealMemory
setting:<br>
RealMemory<br>
Size of real memory on the node in megabytes (e.g. "2048").
The default value is 1.<br>
I would suggest RealMemory=191879 , where I suspect you have
RealMemory=196489092</blockquote>
<br>
</div>
<div>Now the slurmctld logs show this:</div>
<div><br>
<font face="monospace">[2020-01-20T13:22:48.256] error: Node
node002 has low real_memory size (191840 < 196489092)<br>
[2020-01-20T13:22:48.256] error: Setting node node002 state
to DRAIN<br>
[2020-01-20T13:22:48.256] drain_nodes: node node002 state
set to DRAIN<br>
[2020-01-20T13:22:48.256] error:
_slurm_rpc_node_registration node=node002: Invalid argument<br>
[2020-01-20T13:22:48.256] error: Node node001 has low
real_memory size (191846 < 196489092)<br>
[2020-01-20T13:22:48.256] error: Setting node node001 state
to DRAIN<br>
[2020-01-20T13:22:48.256] drain_nodes: node node001 state
set to DRAIN<br>
[2020-01-20T13:22:48.256] error:
_slurm_rpc_node_registration node=node001: Invalid argument<br>
[2020-01-20T13:22:48.256] error: Node node003 has low
real_memory size (191840 < 196489092)<br>
[2020-01-20T13:22:48.256] error: Setting node node003 state
to DRAIN<br>
[2020-01-20T13:22:48.256] drain_nodes: node node003 state
set to DRAIN<br>
[2020-01-20T13:22:48.256] error:
_slurm_rpc_node_registration node=node003: Invalid argument</font><br>
</div>
<div><br>
</div>
<div>Here's the setting in slurm.conf:</div>
<div>/etc/slurm/slurm.conf<br>
# Nodes<br>
NodeName=node[001-003] CoresPerSocket=12 RealMemory=196489092
Sockets=2 Gres=gpu:1<br>
# Partitions<br>
PartitionName=defq Default=YES MinNodes=1 AllowGroups=ALL
PriorityJobFactor=1 PriorityTier=1 DisableRootJobs=NO
RootOnly=NO Hidden=NO Shared=NO GraceTime=0 Preempt$<br>
PartitionName=gpuq Default=NO MinNodes=1 AllowGroups=ALL
PriorityJobFactor=1 PriorityTier=1 DisableRootJobs=NO
RootOnly=NO Hidden=NO Shared=NO GraceTime=0 PreemptM$<br>
</div>
<div><br>
</div>
<div>sinfo -N<br>
NODELIST NODES PARTITION STATE<br>
node001 1 defq* drain<br>
node002 1 defq* drain<br>
node003 1 defq* drain<br>
</div>
<div><br>
</div>
<div>sinfo -N<br>
NODELIST NODES PARTITION STATE<br>
node001 1 defq* drain<br>
node002 1 defq* drain<br>
node003 1 defq* drain<br>
<br>
[2020-01-20T12:50:51.034] error: Node node003 has low
real_memory size (191840 < 196489092)<br>
[2020-01-20T12:50:51.034] error: _slurm_rpc_node_registration
node=node003: Invalid argument<br>
<br>
/etc/slurm/slurm.conf<br>
# Nodes<br>
NodeName=node[001-003] CoresPerSocket=12 RealMemory=196489092
Sockets=2 Gres=gpu:1<br>
# Partitions<br>
PartitionName=defq Default=YES MinNodes=1 AllowGroups=ALL
PriorityJobFactor=1 PriorityTier=1 DisableRootJobs=NO
RootOnly=NO Hidden=NO Shared=NO GraceTime=0 Preempt$<br>
PartitionName=gpuq Default=NO MinNodes=1 AllowGroups=ALL
PriorityJobFactor=1 PriorityTier=1 DisableRootJobs=NO
RootOnly=NO Hidden=NO Shared=NO GraceTime=0 PreemptM$<br>
<br>
pdsh -w node00[1-3] "lscpu | grep -iE 'socket|core'"<br>
node001: Thread(s) per core: 1<br>
node001: Core(s) per socket: 12<br>
node001: Socket(s): 2<br>
node002: Thread(s) per core: 1<br>
node002: Core(s) per socket: 12<br>
node002: Socket(s): 2<br>
node003: Thread(s) per core: 2<br>
node003: Core(s) per socket: 12<br>
node003: Socket(s): 2<br>
<br>
module load cmsh<br>
[root@ciscluster kudyba]# cmsh<br>
[ciscluster]% jobqueue<br>
[ciscluster->jobqueue(slurm)]% ls<br>
Type Name Nodes<br>
------------ ------------------------
----------------------------------------------------<br>
Slurm defq node001..node003<br>
Slurm gpuq<br>
<br>
use defq<br>
[ciscluster->jobqueue(slurm)->defq]% get options<br>
QoS=N/A ExclusiveUser=NO OverSubscribe=FORCE:12
OverTimeLimit=0 State=UP<br>
<br>
scontrol show nodes node001<br>
NodeName=node001 Arch=x86_64 CoresPerSocket=12<br>
CPUAlloc=0 CPUErr=0 CPUTot=24 CPULoad=0.07<br>
AvailableFeatures=(null)<br>
ActiveFeatures=(null)<br>
Gres=gpu:1<br>
NodeAddr=node001 NodeHostName=node001 Version=17.11<br>
OS=Linux 3.10.0-862.2.3.el7.x86_64 #1 SMP Wed May 9
18:05:47 UTC 2018<br>
RealMemory=196489092 AllocMem=0 FreeMem=98557 Sockets=2
Boards=1<br>
State=IDLE+DRAIN ThreadsPerCore=1 TmpDisk=0 Weight=1
Owner=N/A MCS_label=N/A<br>
Partitions=defq<br>
BootTime=2019-07-18T12:08:42
SlurmdStartTime=2020-01-17T21:34:15<br>
CfgTRES=cpu=24,mem=196489092M,billing=24<br>
AllocTRES=<br>
CapWatts=n/a<br>
CurrentWatts=0 LowestJoules=0 ConsumedJoules=0<br>
ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s<br>
Reason=Low RealMemory [slurm@2020-01-20T13:22:48]<br>
<br>
sinfo -R<br>
REASON USER TIMESTAMP NODELIST<br>
Low RealMemory slurm 2020-01-20T13:22:48
node[001-003]<br>
</div>
<div><br>
</div>
<div>And the total memory in each node:</div>
<div>ssh node001<br>
Last login: Mon Jan 20 13:34:00 2020<br>
[root@node001 ~]# free -h<br>
total used free shared
buff/cache available<br>
Mem: 187G 69G 96G 4.0G
21G 112G<br>
Swap: 11G 11G 55M<br>
</div>
<div><br>
</div>
<div>What setting is incorrect here?</div>
</div>
</blockquote>
</div>
</blockquote></div>