[slurm-users] [EXT] wrong number of jobs used
Adrian Sevcenco
Adrian.Sevcenco at spacescience.ro
Tue Jan 19 22:37:13 UTC 2021
On 1/19/21 11:25 PM, Sean Crosby wrote:
> Hi Adrian,
Hi!
> From this output
>
> AVAIL NODES(A/I/O/T) CPUS(A/I/O/T) DEFAULTTIME TIMELIMIT
> up 23/0/0/23 837/587/0/1424 1-00:00:00 2-00:00:00
>
> It shows that all 23 nodes have at least one job running on it.
>
> So what happens if you run scontrol show node on a few of the nodes? I'm
> particularly interested in the AllocTRES section
>
> e.g. for one of my nodes,
>
> # scontrol show node spartan-bm055 | grep 'NodeName\|CfgTRES\|AllocTRES'
> NodeName=spartan-bm055 Arch=x86_64 CoresPerSocket=18
> CfgTRES=cpu=72,mem=1519000M,billing=6005
> AllocTRES=cpu=72,mem=441840M
>
> It shows that for this node, it has 72 cores and 1.5TB RAM (the CfgTRES
> part), and currently jobs are using 72 cores, and 442GB RAM.
>
> I would run the same command on 4 or 5 of the nodes on your cluster, and
> we'll have a better idea about what's going on.
First of all thanks for answering and for the tip (i did not thought to
look there)
But it turned out that my fs.file-max that was ok so far with 65500 on
the nodes with 48 slots, was no longer enough for the new single socket
nodes with 128 slots :)
this killed the service that was serving the actual files/software for
the jobs so, those nodes were a little bit zombies : the already present
jobs were in memory and actual running but no other jobs could be
started on those nodes.
Thanks and again sorry for the noise!
Adrian
>
> Sean
>
> --
> Sean Crosby | Senior DevOpsHPC Engineer and HPC Team Lead
> Research Computing Services | Business Services
> The University of Melbourne, Victoria 3010 Australia
>
>
>
> On Wed, 20 Jan 2021 at 06:50, Adrian Sevcenco
> <Adrian.Sevcenco at spacescience.ro
> <mailto:Adrian.Sevcenco at spacescience.ro>> wrote:
>
> UoM notice: External email. Be cautious of links, attachments, or
> impersonation attempts
>
> Hi! So, i have a very strange situation that i do not even know how to
> troubleshoot...
> I'm running with
> SelectType=select/cons_res
> SelectTypeParameters=CR_CPU_Memory,CR_LLN
> TaskPlugin=task/affinity,task/cgroup
> TaskPluginParam=autobind=threads
>
> and a partition defined with:
> LLN=yes DefMemPerCPU=4000 MaxMemPerCPU=4040
>
> PriorityType=priority/basic
> SchedulerType=sched/builtin
>
> This is a HEP cluster, so only serial single thread jobs.
>
> (physically all nodes have 4 GB/thread)
> the nodes are defined (now, only after a lot of experimentation and
> realization that if the node properties could and are incompatible with
> CR_CPU) just with CPUs and RealMemory defined (obtained from slurmd -C
> on each node)
>
> and with FastSchedule=0
>
> the problem is that the partition is stuck to a low number (around 834
> from 1424)
>
> AVAIL NODES(A/I/O/T) CPUS(A/I/O/T) DEFAULTTIME TIMELIMIT
> up 23/0/0/23 837/587/0/1424 1-00:00:00 2-00:00:00
>
>
> i set up SlurmctldDebug=debug and
> DebugFlags=Priority,SelectType,NodeFeatures,CPU_Bind,NO_CONF_HASH
>
> but i am not able to recognize anything as a problem.
>
> Do anyone have any idea why not all my slots would be used?
>
> Thank you!!
> Adrian
>
>
--
----------------------------------------------
Adrian Sevcenco, Ph.D. |
Institute of Space Science - ISS, Romania |
adrian.sevcenco at {cern.ch,spacescience.ro} |
----------------------------------------------
More information about the slurm-users
mailing list