<div dir="ltr">Sounds like maybe you didn't correctly roll out / update your slurm.conf everywhere as your RealMemory value is back to your large wrong number. You need to update your slurm.conf everywhere and restart all the slurm daemons.<div><br></div><div>I recommend the "safe procedure" from here: <a href="https://wiki.fysik.dtu.dk/niflheim/SLURM#add-and-remove-nodes">https://wiki.fysik.dtu.dk/niflheim/SLURM#add-and-remove-nodes</a></div><div>Your Bright manual may have a similar process for updating SLURM config "the Bright way".</div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Thu, Aug 29, 2019 at 12:20 PM Robert Kudyba <<a href="mailto:rkudyba@fordham.edu">rkudyba@fordham.edu</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">I thought I had taken care of this a while back but it appears the issue has returned. A very simply sbatch slurmhello.sh:<br>
cat slurmhello.sh<br>
#!/bin/sh<br>
#SBATCH -o my.stdout<br>
#SBATCH -N 3<br>
#SBATCH --ntasks=16<br>
module add shared openmpi/gcc/64/1.10.7 slurm<br>
mpirun hello<br>
<br>
sbatch slurmhello.sh<br>
Submitted batch job 419<br>
<br>
squeue<br>
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)<br>
419 defq slurmhel root PD 0:00 3 (Resources)<br>
<br>
In /etc/slurm/slurm.conf:<br>
# Nodes<br>
NodeName=node[001-003] CoresPerSocket=12 RealMemory=196489092 Sockets=2 Gres=gpu:1<br>
<br>
Logs show:<br>
[2019-08-29T14:24:40.025] error: _slurm_rpc_node_registration node=node001: Invalid argument<br>
[2019-08-29T14:24:40.025] error: Node node002 has low real_memory size (191840 < 196489092)<br>
[2019-08-29T14:24:40.025] error: _slurm_rpc_node_registration node=node002: Invalid argument<br>
[2019-08-29T14:24:40.026] error: Node node003 has low real_memory size (191840 < 196489092)<br>
[2019-08-29T14:24:40.026] error: _slurm_rpc_node_registration node=node003: Invalid argument<br>
<br>
scontrol show jobid -dd 419<br>
JobId=419 JobName=slurmhello.sh<br>
UserId=root(0) GroupId=root(0) MCS_label=N/A<br>
Priority=4294901759 Nice=0 Account=root QOS=normal<br>
JobState=PENDING Reason=Resources Dependency=(null)<br>
Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0<br>
DerivedExitCode=0:0<br>
RunTime=00:00:00 TimeLimit=UNLIMITED TimeMin=N/A<br>
SubmitTime=2019-08-28T09:54:22 EligibleTime=2019-08-28T09:54:22<br>
StartTime=Unknown EndTime=Unknown Deadline=N/A<br>
PreemptTime=None SuspendTime=None SecsPreSuspend=0<br>
LastSchedEval=2019-08-28T09:57:22<br>
Partition=defq AllocNode:Sid=ourcluster:194152<br>
ReqNodeList=(null) ExcNodeList=(null)<br>
NodeList=(null)<br>
NumNodes=3-3 NumCPUs=16 NumTasks=16 CPUs/Task=1 ReqB:S:C:T=0:0:*:*<br>
TRES=cpu=16,node=3<br>
Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*<br>
MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0<br>
Features=(null) DelayBoot=00:00:00<br>
Gres=(null) Reservation=(null)<br>
OverSubscribe=YES Contiguous=0 Licenses=(null) Network=(null)<br>
Command=/root/slurmhello.sh<br>
WorkDir=/root<br>
StdErr=/root/my.stdout<br>
StdIn=/dev/null<br>
StdOut=/root/my.stdout<br>
Power=<br>
<br>
scontrol show nodes node001<br>
NodeName=node001 Arch=x86_64 CoresPerSocket=12<br>
CPUAlloc=0 CPUErr=0 CPUTot=24 CPULoad=0.06<br>
AvailableFeatures=(null)<br>
ActiveFeatures=(null)<br>
Gres=gpu:1<br>
NodeAddr=node001 NodeHostName=node001 Version=17.11<br>
OS=Linux 3.10.0-862.2.3.el7.x86_64 #1 SMP Wed May 9 18:05:47 UTC 2018<br>
RealMemory=196489092 AllocMem=0 FreeMem=99923 Sockets=2 Boards=1<br>
State=IDLE+DRAIN ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A<br>
Partitions=defq<br>
BootTime=2019-07-18T12:08:41 SlurmdStartTime=2019-07-18T12:09:44<br>
CfgTRES=cpu=24,mem=196489092M,billing=24<br>
AllocTRES=<br>
CapWatts=n/a<br>
CurrentWatts=0 LowestJoules=0 ConsumedJoules=0<br>
ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s<br>
Reason=Low RealMemory [slurm@2019-07-18T10:17:24]<br>
<br>
[root@ciscluster ~]# scontrol show nodes| grep -i mem<br>
RealMemory=196489092 AllocMem=0 FreeMem=99923 Sockets=2 Boards=1<br>
CfgTRES=cpu=24,mem=196489092M,billing=24<br>
Reason=Low RealMemory [slurm@2019-07-18T10:17:24]<br>
RealMemory=196489092 AllocMem=0 FreeMem=180969 Sockets=2 Boards=1<br>
CfgTRES=cpu=24,mem=196489092M,billing=24<br>
Reason=Low RealMemory [slurm@2019-07-18T10:17:24]<br>
RealMemory=196489092 AllocMem=0 FreeMem=178999 Sockets=2 Boards=1<br>
CfgTRES=cpu=24,mem=196489092M,billing=24<br>
Reason=Low RealMemory [slurm@2019-07-18T10:17:24]<br>
<br>
sinfo -R<br>
REASON USER TIMESTAMP NODELIST<br>
Low RealMemory slurm 2019-07-18T10:17:24 node[001-003]<br>
<br>
sinfo -N<br>
NODELIST NODES PARTITION STATE<br>
node001 1 defq* drain<br>
node002 1 defq* drain<br>
node003 1 defq* drain<br>
<br>
pdsh -w node00[1-3] "lscpu | grep -iE 'socket|core'"<br>
node002: Thread(s) per core: 1<br>
node002: Core(s) per socket: 12<br>
node002: Socket(s): 2<br>
node001: Thread(s) per core: 1<br>
node001: Core(s) per socket: 12<br>
node001: Socket(s): 2<br>
node003: Thread(s) per core: 2<br>
node003: Core(s) per socket: 12<br>
node003: Socket(s): 2<br>
<br>
scontrol show nodes| grep -i mem<br>
RealMemory=196489092 AllocMem=0 FreeMem=100054 Sockets=2 Boards=1<br>
CfgTRES=cpu=24,mem=196489092M,billing=24<br>
Reason=Low RealMemory [slurm@2019-07-18T10:17:24]<br>
RealMemory=196489092 AllocMem=0 FreeMem=181101 Sockets=2 Boards=1<br>
CfgTRES=cpu=24,mem=196489092M,billing=24<br>
Reason=Low RealMemory [slurm@2019-07-18T10:17:24]<br>
RealMemory=196489092 AllocMem=0 FreeMem=179004 Sockets=2 Boards=1<br>
CfgTRES=cpu=24,mem=196489092M,billing=24<br>
Reason=Low RealMemory<br>
<br>
Does anything look off?<br>
</blockquote></div>