<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
</head>
<body text="#000000" bgcolor="#FFFFFF">
<p>After you restart slurmctld do "scontrol reconfigure"</p>
<p>Brian Andrus<br>
</p>
<div class="moz-cite-prefix">On 8/30/2019 6:57 AM, Robert Kudyba
wrote:<br>
</div>
<blockquote type="cite"
cite="mid:CAFHi+KRb=JUqxFeYVENV5m_524JAMZ0-FF8k9n7KP43D8XOmZw@mail.gmail.com">
<meta http-equiv="content-type" content="text/html; charset=UTF-8">
<div dir="ltr">I had set RealMemory to a really high number as I
mis-interpreted the recommendation.
<div>NodeName=node[001-003] CoresPerSocket=12 RealMemory=
196489092 Sockets=2 Gres=gpu:1</div>
<div><br>
</div>
<div>But now I set it to:</div>
<div>RealMemory=191000<br>
<div><br>
</div>
<div>I restarted slurmctld. And according to the Bright
Cluster support team:</div>
<div>"Unless it has been overridden in the image, the nodes
will have a symlink directly to the slurm.conf on the head
node. This means that any changes made to the file on the
head node will automatically be available to the compute
nodes. All they would need in that case is to have slurmd
restarted"</div>
<div><br>
</div>
<div>But now I see these errors:</div>
<div><br>
</div>
<div>mcs: MCSParameters = (null). ondemand set.<br>
[2019-08-30T09:22:41.700] error: Node node001 appears to
have a different slurm.conf than the slurmctld. This could
cause issues with communication and functionality. Please
review both files and make sure they are the same. If this
is expected ignore, and set DebugFlags=NO_CONF_HASH in your
slurm.conf.<br>
[2019-08-30T09:22:41.700] error: Node node002 appears to
have a different slurm.conf than the slurmctld. This could
cause issues with communication and functionality. Please
review both files and make sure they are the same. If this
is expected ignore, and set DebugFlags=NO_CONF_HASH in your
slurm.conf.<br>
[2019-08-30T09:22:41.701] error: Node node003 appears to
have a different slurm.conf than the slurmctld. This could
cause issues with communication and functionality. Please
review both files and make sure they are the same. If this
is expected ignore, and set DebugFlags=NO_CONF_HASH in your
slurm.conf.<br>
[2019-08-30T09:23:16.347] update_node: node node001 state
set to IDLE<br>
[2019-08-30T09:23:16.347] got (nil)<br>
[2019-08-30T09:23:16.766]
SchedulerParameters=default_queue_depth=100,max_rpc_cnt=0,max_sched_time=2,partition_job_depth=0,sched_max_job_start=0,sched_min_interval=2<br>
[2019-08-30T09:23:19.082] update_node: node node002 state
set to IDLE<br>
[2019-08-30T09:23:19.082] got (nil)<br>
[2019-08-30T09:23:20.929] update_node: node node003 state
set to IDLE<br>
[2019-08-30T09:23:20.929] got (nil)<br>
[2019-08-30T09:45:46.314] _slurm_rpc_submit_batch_job:
JobId=449 InitPrio=4294901759 usec=355<br>
[2019-08-30T09:45:46.430] sched: Allocate JobID=449
NodeList=node[001-003] #CPUs=30 Partition=defq<br>
[2019-08-30T09:45:46.670] prolog_running_decr: Configuration
for JobID=449 is complete<br>
[2019-08-30T09:45:46.772] _job_complete: JobID=449 State=0x1
NodeCnt=3 WEXITSTATUS 127<br>
[2019-08-30T09:45:46.772] _job_complete: JobID=449
State=0x8005 NodeCnt=3 done<br>
</div>
</div>
<div><br>
</div>
<div>Is this another option that needs to be set?</div>
</div>
<br>
<div class="gmail_quote">
<div dir="ltr" class="gmail_attr">On Thu, Aug 29, 2019 at 3:27
PM Alex Chekholko <<a href="mailto:alex@calicolabs.com"
target="_blank" moz-do-not-send="true">alex@calicolabs.com</a>>
wrote:<br>
</div>
<blockquote class="gmail_quote" style="margin:0px 0px 0px
0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<div dir="ltr">Sounds like maybe you didn't correctly roll out
/ update your slurm.conf everywhere as your RealMemory value
is back to your large wrong number. You need to update your
slurm.conf everywhere and restart all the slurm daemons.
<div><br>
</div>
<div>I recommend the "safe procedure" from here: <a
href="https://urldefense.proofpoint.com/v2/url?u=https-3A__wiki.fysik.dtu.dk_niflheim_SLURM-23add-2Dand-2Dremove-2Dnodes&d=DwMFaQ&c=aqMfXOEvEJQh2iQMCb7Wy8l0sPnURkcqADc2guUW8IM&r=X0jL9y0sL4r4iU_qVtR3lLNo4tOL1ry_m7-psV3GejY&m=yUZtCS8lFs9N4Dm1nidebq1bpGa9QMJUap7ZWVR8NVg&s=Fq72zWoETitTA7ayJCyYkbp8E1fInntp4YeBv75o7vU&e="
target="_blank" moz-do-not-send="true">https://wiki.fysik.dtu.dk/niflheim/SLURM#add-and-remove-nodes</a></div>
<div>Your Bright manual may have a similar process for
updating SLURM config "the Bright way".</div>
</div>
<br>
<div class="gmail_quote">
<div dir="ltr" class="gmail_attr">On Thu, Aug 29, 2019 at
12:20 PM Robert Kudyba <<a
href="mailto:rkudyba@fordham.edu" target="_blank"
moz-do-not-send="true">rkudyba@fordham.edu</a>>
wrote:<br>
</div>
<blockquote class="gmail_quote" style="margin:0px 0px 0px
0.8ex;border-left:1px solid
rgb(204,204,204);padding-left:1ex">I thought I had taken
care of this a while back but it appears the issue has
returned. A very simply sbatch slurmhello.sh:<br>
cat slurmhello.sh<br>
#!/bin/sh<br>
#SBATCH -o my.stdout<br>
#SBATCH -N 3<br>
#SBATCH --ntasks=16<br>
module add shared openmpi/gcc/64/1.10.7 slurm<br>
mpirun hello<br>
<br>
sbatch slurmhello.sh<br>
Submitted batch job 419<br>
<br>
squeue<br>
JOBID PARTITION NAME USER ST
TIME NODES NODELIST(REASON)<br>
419 defq slurmhel root PD
0:00 3 (Resources)<br>
<br>
In /etc/slurm/slurm.conf:<br>
# Nodes<br>
NodeName=node[001-003] CoresPerSocket=12
RealMemory=196489092 Sockets=2 Gres=gpu:1<br>
<br>
Logs show:<br>
[2019-08-29T14:24:40.025] error:
_slurm_rpc_node_registration node=node001: Invalid
argument<br>
[2019-08-29T14:24:40.025] error: Node node002 has low
real_memory size (191840 < 196489092)<br>
[2019-08-29T14:24:40.025] error:
_slurm_rpc_node_registration node=node002: Invalid
argument<br>
[2019-08-29T14:24:40.026] error: Node node003 has low
real_memory size (191840 < 196489092)<br>
[2019-08-29T14:24:40.026] error:
_slurm_rpc_node_registration node=node003: Invalid
argument<br>
<br>
scontrol show jobid -dd 419<br>
JobId=419 JobName=slurmhello.sh<br>
UserId=root(0) GroupId=root(0) MCS_label=N/A<br>
Priority=4294901759 Nice=0 Account=root QOS=normal<br>
JobState=PENDING Reason=Resources Dependency=(null)<br>
Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0<br>
DerivedExitCode=0:0<br>
RunTime=00:00:00 TimeLimit=UNLIMITED TimeMin=N/A<br>
SubmitTime=2019-08-28T09:54:22
EligibleTime=2019-08-28T09:54:22<br>
StartTime=Unknown EndTime=Unknown Deadline=N/A<br>
PreemptTime=None SuspendTime=None SecsPreSuspend=0<br>
LastSchedEval=2019-08-28T09:57:22<br>
Partition=defq AllocNode:Sid=ourcluster:194152<br>
ReqNodeList=(null) ExcNodeList=(null)<br>
NodeList=(null)<br>
NumNodes=3-3 NumCPUs=16 NumTasks=16 CPUs/Task=1
ReqB:S:C:T=0:0:*:*<br>
TRES=cpu=16,node=3<br>
Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*<br>
MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0<br>
Features=(null) DelayBoot=00:00:00<br>
Gres=(null) Reservation=(null)<br>
OverSubscribe=YES Contiguous=0 Licenses=(null)
Network=(null)<br>
Command=/root/slurmhello.sh<br>
WorkDir=/root<br>
StdErr=/root/my.stdout<br>
StdIn=/dev/null<br>
StdOut=/root/my.stdout<br>
Power=<br>
<br>
scontrol show nodes node001<br>
NodeName=node001 Arch=x86_64 CoresPerSocket=12<br>
CPUAlloc=0 CPUErr=0 CPUTot=24 CPULoad=0.06<br>
AvailableFeatures=(null)<br>
ActiveFeatures=(null)<br>
Gres=gpu:1<br>
NodeAddr=node001 NodeHostName=node001 Version=17.11<br>
OS=Linux 3.10.0-862.2.3.el7.x86_64 #1 SMP Wed May 9
18:05:47 UTC 2018<br>
RealMemory=196489092 AllocMem=0 FreeMem=99923 Sockets=2
Boards=1<br>
State=IDLE+DRAIN ThreadsPerCore=1 TmpDisk=0 Weight=1
Owner=N/A MCS_label=N/A<br>
Partitions=defq<br>
BootTime=2019-07-18T12:08:41
SlurmdStartTime=2019-07-18T12:09:44<br>
CfgTRES=cpu=24,mem=196489092M,billing=24<br>
AllocTRES=<br>
CapWatts=n/a<br>
CurrentWatts=0 LowestJoules=0 ConsumedJoules=0<br>
ExtSensorsJoules=n/s ExtSensorsWatts=0
ExtSensorsTemp=n/s<br>
Reason=Low RealMemory [slurm@2019-07-18T10:17:24]<br>
<br>
[root@ciscluster ~]# scontrol show nodes| grep -i mem<br>
RealMemory=196489092 AllocMem=0 FreeMem=99923 Sockets=2
Boards=1<br>
CfgTRES=cpu=24,mem=196489092M,billing=24<br>
Reason=Low RealMemory [slurm@2019-07-18T10:17:24]<br>
RealMemory=196489092 AllocMem=0 FreeMem=180969
Sockets=2 Boards=1<br>
CfgTRES=cpu=24,mem=196489092M,billing=24<br>
Reason=Low RealMemory [slurm@2019-07-18T10:17:24]<br>
RealMemory=196489092 AllocMem=0 FreeMem=178999
Sockets=2 Boards=1<br>
CfgTRES=cpu=24,mem=196489092M,billing=24<br>
Reason=Low RealMemory [slurm@2019-07-18T10:17:24]<br>
<br>
sinfo -R<br>
REASON USER TIMESTAMP
NODELIST<br>
Low RealMemory slurm 2019-07-18T10:17:24
node[001-003]<br>
<br>
sinfo -N<br>
NODELIST NODES PARTITION STATE<br>
node001 1 defq* drain<br>
node002 1 defq* drain<br>
node003 1 defq* drain<br>
<br>
pdsh -w node00[1-3] "lscpu | grep -iE 'socket|core'"<br>
node002: Thread(s) per core: 1<br>
node002: Core(s) per socket: 12<br>
node002: Socket(s): 2<br>
node001: Thread(s) per core: 1<br>
node001: Core(s) per socket: 12<br>
node001: Socket(s): 2<br>
node003: Thread(s) per core: 2<br>
node003: Core(s) per socket: 12<br>
node003: Socket(s): 2<br>
<br>
scontrol show nodes| grep -i mem<br>
RealMemory=196489092 AllocMem=0 FreeMem=100054
Sockets=2 Boards=1<br>
CfgTRES=cpu=24,mem=196489092M,billing=24<br>
Reason=Low RealMemory [slurm@2019-07-18T10:17:24]<br>
RealMemory=196489092 AllocMem=0 FreeMem=181101
Sockets=2 Boards=1<br>
CfgTRES=cpu=24,mem=196489092M,billing=24<br>
Reason=Low RealMemory [slurm@2019-07-18T10:17:24]<br>
RealMemory=196489092 AllocMem=0 FreeMem=179004
Sockets=2 Boards=1<br>
CfgTRES=cpu=24,mem=196489092M,billing=24<br>
Reason=Low RealMemory<br>
<br>
Does anything look off?<br>
</blockquote>
</div>
</blockquote>
</div>
</blockquote>
</body>
</html>