[slurm-users] [External] Re: Slurm queue seems to be completely blocked

Joakim Hove joakim.hove at gmail.com
Mon May 11 18:37:53 UTC 2020


> You’re on the right track with the DRAIN state. The more specific answer
> is in the “Reason=” description on the last line.
>
> It looks like your node has less memory than what you’ve defined for the
> node in slurm.conf
>

Thank you; that sounded meaningful to me. My slurm.conf file had
RealMemory=983 whereas "slurmd -C" showed "RealMemory=978" - so you are
right; the actual node had less available memory than what I configured in
slurm.conf - I guess the reason for the difference is slightly different
AWS nodes? Anyay I updated the slurm.conf with "RealMemory=512" - i.e. with
a wide margin less than the what the node actually has. After restarting
slurmctld / slurmd I now get:

ubuntu at ip-172-31-80-232:~/opm-portal/aws$ scontrol show node
NodeName=ip-172-31-80-232 Arch=x86_64 CoresPerSocket=1
   CPUAlloc=0 CPUErr=0 CPUTot=1 CPULoad=0.00
   AvailableFeatures=(null)
   ActiveFeatures=(null)
   Gres=(null)
   NodeAddr=ip-172-31-80-232 NodeHostName=ip-172-31-80-232 Version=17.11
   OS=Linux 5.3.0-1017-aws #18~18.04.1-Ubuntu SMP Wed Apr 8 15:12:16 UTC
2020
   RealMemory=512 AllocMem=0 FreeMem=254 Sockets=1 Boards=1
   State=IDLE+DRAIN ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A
MCS_label=N/A
   Partitions=debug
   BootTime=2020-05-11T17:02:15 SlurmdStartTime=2020-05-11T18:29:30
   CfgTRES=cpu=1,mem=512M,billing=1
   AllocTRES=
   CapWatts=n/a
   CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
   Reason=Low RealMemory [root at 2020-05-11T16:20:02]

I.e. slurm has recognized the new memory setting, but the state is still
"IDLE+DRAIN" - and no jobs start running :-(
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20200511/b99606fd/attachment-0001.htm>


More information about the slurm-users mailing list