[slurm-users] One node is not used by slurm

Mahmood Naderan mahmood.nt at gmail.com
Sun Apr 19 18:10:08 UTC 2020


Hi,
Although compute-0-0 is included in a partition, I have noticed that
no job is offloaded there automatically. If someone intentionally
write --nodelist=compute-0-0 it will be fine.

# grep -r compute-0-0 .
./nodenames.conf.new:NodeName=compute-0-0 NodeAddr=10.1.1.254 CPUs=32
Weight=20511900 Feature=rack-0,32CPUs
./node.conf:NodeName=compute-0-0 NodeAddr=10.1.1.254 CPUs=32
Weight=20511900 Feature=rack-0,32CPUs
./nodenames.conf.new4:NodeName=compute-0-0 NodeAddr=10.1.1.254 CPUs=32
Weight=20511900 Feature=rack-0,32CPUs
# grep -r compute-0-1 .
./nodenames.conf.new:NodeName=compute-0-1 NodeAddr=10.1.1.253 CPUs=32
Weight=20511899 Feature=rack-0,32CPUs
./node.conf:NodeName=compute-0-1 NodeAddr=10.1.1.253 CPUs=32
Weight=20511899 Feature=rack-0,32CPUs
./nodenames.conf.new4:NodeName=compute-0-1 NodeAddr=10.1.1.253 CPUs=32
Weight=20511899 Feature=rack-0,32CPUs
# cat parts
PartitionName=WHEEL RootOnly=yes Priority=1000 Nodes=ALL
PartitionName=SEA AllowAccounts=fish Nodes=ALL
# scontrol show node compute-0-0
NodeName=compute-0-0 Arch=x86_64 CoresPerSocket=1
   CPUAlloc=0 CPUTot=32 CPULoad=0.01
   AvailableFeatures=rack-0,32CPUs
   ActiveFeatures=rack-0,32CPUs
   Gres=(null)
   NodeAddr=10.1.1.254 NodeHostName=compute-0-0
   OS=Linux 3.10.0-1062.1.2.el7.x86_64 #1 SMP Mon Sep 30 14:19:46 UTC 2019
   RealMemory=64259 AllocMem=0 FreeMem=63421 Sockets=32 Boards=1
   State=IDLE ThreadsPerCore=1 TmpDisk=444124 Weight=20511900
Owner=N/A MCS_label=N/A
   Partitions=CLUSTER,WHEEL,SEA
   BootTime=2020-04-18T10:30:07 SlurmdStartTime=2020-04-19T22:32:12
   CfgTRES=cpu=32,mem=64259M,billing=47
   AllocTRES=
   CapWatts=n/a
   CurrentWatts=0 AveWatts=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s

# squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES
NODELIST(REASON)
               436       SEA  relax13   raz  R   21:44:22      3
compute-0-[1-2],hpc
               435       SEA 261660mo abb  R 1-05:19:31      3
compute-0-[1-2],hpc

Compute-0-0 is idle. So, why slurm decided to put those jobs on other nodes?
Any idea for debugging?


Regards,
Mahmood



More information about the slurm-users mailing list