[slurm-users] One node is not used by slurm

Renfro, Michael Renfro at tntech.edu
Sun Apr 19 23:21:35 UTC 2020


Someone else might see more than I do, but from what you’ve posted, it’s clear that compute-0-0 will be used only after other lower-weighted nodes are too full to accept a particular job.

I assume you’ve already submitted a set of jobs requesting enough resources to fill up all the nodes, and the some jobs stay in a pending state instead of using compute-0-0, which sits idle?

> On Apr 19, 2020, at 1:10 PM, Mahmood Naderan <mahmood.nt at gmail.com> wrote:
> 
> Hi,
> Although compute-0-0 is included in a partition, I have noticed that
> no job is offloaded there automatically. If someone intentionally
> write --nodelist=compute-0-0 it will be fine.
> 
> # grep -r compute-0-0 .
> ./nodenames.conf.new:NodeName=compute-0-0 NodeAddr=10.1.1.254 CPUs=32
> Weight=20511900 Feature=rack-0,32CPUs
> ./node.conf:NodeName=compute-0-0 NodeAddr=10.1.1.254 CPUs=32
> Weight=20511900 Feature=rack-0,32CPUs
> ./nodenames.conf.new4:NodeName=compute-0-0 NodeAddr=10.1.1.254 CPUs=32
> Weight=20511900 Feature=rack-0,32CPUs
> # grep -r compute-0-1 .
> ./nodenames.conf.new:NodeName=compute-0-1 NodeAddr=10.1.1.253 CPUs=32
> Weight=20511899 Feature=rack-0,32CPUs
> ./node.conf:NodeName=compute-0-1 NodeAddr=10.1.1.253 CPUs=32
> Weight=20511899 Feature=rack-0,32CPUs
> ./nodenames.conf.new4:NodeName=compute-0-1 NodeAddr=10.1.1.253 CPUs=32
> Weight=20511899 Feature=rack-0,32CPUs
> # cat parts
> PartitionName=WHEEL RootOnly=yes Priority=1000 Nodes=ALL
> PartitionName=SEA AllowAccounts=fish Nodes=ALL
> # scontrol show node compute-0-0
> NodeName=compute-0-0 Arch=x86_64 CoresPerSocket=1
>   CPUAlloc=0 CPUTot=32 CPULoad=0.01
>   AvailableFeatures=rack-0,32CPUs
>   ActiveFeatures=rack-0,32CPUs
>   Gres=(null)
>   NodeAddr=10.1.1.254 NodeHostName=compute-0-0
>   OS=Linux 3.10.0-1062.1.2.el7.x86_64 #1 SMP Mon Sep 30 14:19:46 UTC 2019
>   RealMemory=64259 AllocMem=0 FreeMem=63421 Sockets=32 Boards=1
>   State=IDLE ThreadsPerCore=1 TmpDisk=444124 Weight=20511900
> Owner=N/A MCS_label=N/A
>   Partitions=CLUSTER,WHEEL,SEA
>   BootTime=2020-04-18T10:30:07 SlurmdStartTime=2020-04-19T22:32:12
>   CfgTRES=cpu=32,mem=64259M,billing=47
>   AllocTRES=
>   CapWatts=n/a
>   CurrentWatts=0 AveWatts=0
>   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
> 
> # squeue
>             JOBID PARTITION     NAME     USER ST       TIME  NODES
> NODELIST(REASON)
>               436       SEA  relax13   raz  R   21:44:22      3
> compute-0-[1-2],hpc
>               435       SEA 261660mo abb  R 1-05:19:31      3
> compute-0-[1-2],hpc
> 
> Compute-0-0 is idle. So, why slurm decided to put those jobs on other nodes?
> Any idea for debugging?
> 
> 
> Regards,
> Mahmood
> 



More information about the slurm-users mailing list