[slurm-users] [EXT] wrong number of jobs used

Sean Crosby scrosby at unimelb.edu.au
Tue Jan 19 21:25:11 UTC 2021

Hi Adrian,

>From this output

up      23/0/0/23     837/587/0/1424   1-00:00:00   2-00:00:00

It shows that all 23 nodes have at least one job running on it.

So what happens if you run scontrol show node on a few of the nodes? I'm
particularly interested in the AllocTRES section

e.g. for one of my nodes,

# scontrol show node spartan-bm055 | grep 'NodeName\|CfgTRES\|AllocTRES'
NodeName=spartan-bm055 Arch=x86_64 CoresPerSocket=18

It shows that for this node, it has 72 cores and 1.5TB RAM (the CfgTRES
part), and currently jobs are using 72 cores, and 442GB RAM.

I would run the same command on 4 or 5 of the nodes on your cluster, and
we'll have a better idea about what's going on.


Sean Crosby | Senior DevOpsHPC Engineer and HPC Team Lead
Research Computing Services | Business Services
The University of Melbourne, Victoria 3010 Australia

On Wed, 20 Jan 2021 at 06:50, Adrian Sevcenco <
Adrian.Sevcenco at spacescience.ro> wrote:

> UoM notice: External email. Be cautious of links, attachments, or
> impersonation attempts
> Hi! So, i have a very strange situation that i do not even know how to
> troubleshoot...
> I'm running with
> SelectType=select/cons_res
> SelectTypeParameters=CR_CPU_Memory,CR_LLN
> TaskPlugin=task/affinity,task/cgroup
> TaskPluginParam=autobind=threads
> and a partition defined with:
> LLN=yes DefMemPerCPU=4000 MaxMemPerCPU=4040
> PriorityType=priority/basic
> SchedulerType=sched/builtin
> This is a HEP cluster, so only serial single thread jobs.
> (physically all nodes have 4 GB/thread)
> the nodes are defined (now, only after a lot of experimentation and
> realization that if the node properties could and are incompatible with
> CR_CPU) just with CPUs and RealMemory defined (obtained from slurmd -C
> on each node)
> and with FastSchedule=0
> the problem is that the partition is stuck to a low number (around 834
> from 1424)
> up      23/0/0/23     837/587/0/1424   1-00:00:00   2-00:00:00
> i set up SlurmctldDebug=debug and
> DebugFlags=Priority,SelectType,NodeFeatures,CPU_Bind,NO_CONF_HASH
> but i am not able to recognize anything as a problem.
> Do anyone have any idea why not all my slots would be used?
> Thank you!!
> Adrian
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20210120/26ecea44/attachment.htm>

More information about the slurm-users mailing list