[slurm-users] exclusive or not exclusive, that is the question

Wed Aug 21 05:59:50 UTC 2019

Hi Chris,

it is not my intention, to do such a job. I'm just trying to reconstruct 
a bad behaviour. My users are doing such jobs.

The output of job 2 was a bad example as I saw later, that the job was 
not running already. That output changes for a running job. It more 
looks like:
    NumNodes=1 NumCPUs=48 NumTasks=2 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
    TRES=cpu=48,mem=240000M,node=1,billing=58

Let me first give you a little more background:
complete node definition from our slurm.conf:
NodeName=nrm[001-208] Sockets=4 CoresPerSocket=12 ThreadsPerCore=1 
RealMemory=187200 Feature=skylake,skx8160,hostok,hpcwork Weight=100622 
State=UNKNOWN
complete partition definition from our slurm.conf:
PartitionName=c18m PriorityTier=100 Nodes=ncm[0001-1032],nrm[001-208] 
State=UP DefMemPerCPU=3900 TRESBillingWeights="CPU=1.0,Mem=0.25G"

We intend to do a fair billing regarding the resources, the user asked 
for. That is, why we use the TRESBillingWeights together with 
PriorityFlags=MAX_TRES. If the user uses half of the nodes memory and 
less cores, he should be billed for half of the node. Same for nodes 
with GPUs (not in the above definition). If a nodes possesses two GPUs 
and the user asks for one, he should at least get half the node billed.

So far, what we intended to do. But you can see the problem already in 
the output of job2. The billing is 58, which is more than the 48 cores 
of the node. This is, because, the user asked for 10G per cpu and also 
asked for two tasks. But since the job is exclusive, the job gets 48 
CPUs. SLURM now multiplies the number of CPUs with the requested mem per 
cpu. So, TRES has more memory than the node has altogether.
To make it even worse, I increased the memory:
#SBATCH --mem-per-cpu=90000

output of the job(to be clear: scontrol show jobid):
    NumNodes=1 NumCPUs=48 NumTasks=2 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
    TRES=cpu=48,mem=4320000M,node=1,billing=1054

I definitely know, we have no node with 4 TB memory. Might be, that I 
misunderstood the TRES at all, but I thought, they reflect the 
(theoretical) usage of the job, what is blocked for other jobs.
The user does not have a chance to use the memory, he is being accounted 
for.

I now looked into, what kind of cgroups are generated for these kind of 
jobs.

excerpt from the manpage (slurm.conf)

>        RealMemory
>               Size  of  real  memory  on  the  node in megabytes (e.g. 
> "2048").  The default value is 1. Lowering RealMemory with the goal of 
> setting aside some amount for the OS and not available for job 
> allocations will not work as
>               intended if Memory is not set as a consumable resource 
> in SelectTypeParameters. So one of the *_Memory options need to be 
> enabled for that goal to be accomplished.  Also see MemSpecLimit.

So 187200 MB should be available at max for a job on this partition, 
right? So, what happens, if we omit the --mem-per-cpu option? We get 
exactly the number of CPUs times DefMemPerCPU. The billing therefore is 
48, as intended.
But what happens, if we e.g. set --mem-per-cpu to 10000 MB?
/sys/fs/cgroup/memory/slurm/uid_40574/job_7195054/memory.limit_in_bytes: 
201226977280
This are 191905 MB, as seen on the OS by free -m, and is more than the 
defined 187200MB RealMemory.
As we have no swap enabled on the nodes, this means, the job could crash 
the node.

This at least does not "feel" right.

Best
Marcus

On 8/20/19 4:58 PM, Christopher Benjamin Coffey wrote:
> Hi Marcus,
>
> What is the reason to add "--mem-per-cpu" when the job already has exclusive access to the node? Your job has access to all of the memory, and all of the cores on the system already. Also note, for non-mpi code like single core job, or shared memory threaded job, you want to ask for number of cpus with --cpus-per-task, or -c. Unless you are running mpi code, where you will want to use -n, and --ntasks instead to launch n copies of the code on n cores. In this case, because you asked for -n2, and also specified a mem-per-cpu request, the scheduler is doling out the memory as requested (2 x tasks), likely due to having SelectTypeParameters=CR_Core_Memory in slurm.conf.
>
> Best,
> Chris
>
> —
> Christopher Coffey
> High-Performance Computing
> Northern Arizona University
> 928-523-1167
>   
>
> On 8/20/19, 1:37 AM, "slurm-users on behalf of Marcus Wagner" <slurm-users-bounces at lists.schedmd.com on behalf of wagner at itc.rwth-aachen.de> wrote:
>
>      Just made another test.
>      
>      
>      Thanks god, the exclusivity is not "destroyed" completely, only on job
>      can run on the node, when the job is exclusive. Nonetheless, this is
>      somewhat unintuitive.
>      I wonder, if that also has an influence on the cgroups and the process
>      affinity/binding.
>      
>      I will do some more tests.
>      
>      
>      Best
>      Marcus
>      
>      On 8/20/19 9:47 AM, Marcus Wagner wrote:
>      > Hi Folks,
>      >
>      >
>      > I think, I've stumbled over a BUG in Slurm regarding the
>      > exclusiveness. Might also, I've misinterpreted something. I would be
>      > happy, if someone could explain that to me in the latter case.
>      >
>      > To the background. I have set PriorityFlags=MAX_TRES
>      > The TRESBillingWeights are "CPU=1.0,Mem=0.1875G" for a partition with
>      > 48 core nodes and RealMemory 187200.
>      >
>      > ---
>      >
>      > I have two jobs:
>      >
>      > job 1:
>      > #SBATCH --exclusive
>      > #SBATCH --ntasks=2
>      > #SBATCH --nodes=1
>      >
>      > scontrol show <jobid> =>
>      >    NumNodes=1 NumCPUs=48 NumTasks=2 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
>      >    TRES=cpu=48,mem=187200M,node=1,billing=48
>      >
>      > exactly, what I expected, I got 48 CPUs and therefore the billing is 48.
>      >
>      > ---
>      >
>      > job 2 (just added mem-per-cpu):
>      > #SBATCH --exclusive
>      > #SBATCH --ntasks=2
>      > #SBATCH --nodes=1
>      > #SBATCH --mem-per-cpu=5000
>      >
>      > scontrol show <jobid> =>
>      >    NumNodes=1-1 NumCPUs=2 NumTasks=2 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
>      >    TRES=cpu=2,mem=10000M,node=1,billing=2
>      >
>      > Why "destroys" '--mem-per-cpu' exclusivity?
>      >
>      >
>      >
>      > Best
>      > Marcus
>      >
>      
>      --
>      Marcus Wagner, Dipl.-Inf.
>      
>      IT Center
>      Abteilung: Systeme und Betrieb
>      RWTH Aachen University
>      Seffenter Weg 23
>      52074 Aachen
>      Tel: +49 241 80-24383
>      Fax: +49 241 80-624383
>      wagner at itc.rwth-aachen.de
>      https://nam05.safelinks.protection.outlook.com/?url=www.itc.rwth-aachen.de&data=02%7C01%7Cchris.coffey%40nau.edu%7C4a5803448abd497d7cde08d7254995f2%7C27d49e9f89e14aa099a3d35b57b2ba03%7C0%7C0%7C637018870287848104&sdata=HNuqCBYwrJjBcLGFGYuVKxWe9pqCxt028rrRrJ%2FTYp0%3D&reserved=0
>      
>      
>      
>

-- 
Marcus Wagner, Dipl.-Inf.

IT Center
Abteilung: Systeme und Betrieb
RWTH Aachen University
Seffenter Weg 23
52074 Aachen
Tel: +49 241 80-24383
Fax: +49 241 80-624383
wagner at itc.rwth-aachen.de
www.itc.rwth-aachen.de