[slurm-users] exclusive or not exclusive, that is the question

Marcus Wagner wagner at itc.rwth-aachen.de
Wed Aug 21 05:59:50 UTC 2019

Hi Chris,

it is not my intention, to do such a job. I'm just trying to reconstruct 
a bad behaviour. My users are doing such jobs.

The output of job 2 was a bad example as I saw later, that the job was 
not running already. That output changes for a running job. It more 
looks like:
    NumNodes=1 NumCPUs=48 NumTasks=2 CPUs/Task=1 ReqB:S:C:T=0:0:*:*

Let me first give you a little more background:
complete node definition from our slurm.conf:
NodeName=nrm[001-208] Sockets=4 CoresPerSocket=12 ThreadsPerCore=1 
RealMemory=187200 Feature=skylake,skx8160,hostok,hpcwork Weight=100622 
complete partition definition from our slurm.conf:
PartitionName=c18m PriorityTier=100 Nodes=ncm[0001-1032],nrm[001-208] 
State=UP DefMemPerCPU=3900 TRESBillingWeights="CPU=1.0,Mem=0.25G"

We intend to do a fair billing regarding the resources, the user asked 
for. That is, why we use the TRESBillingWeights together with 
PriorityFlags=MAX_TRES. If the user uses half of the nodes memory and 
less cores, he should be billed for half of the node. Same for nodes 
with GPUs (not in the above definition). If a nodes possesses two GPUs 
and the user asks for one, he should at least get half the node billed.

So far, what we intended to do. But you can see the problem already in 
the output of job2. The billing is 58, which is more than the 48 cores 
of the node. This is, because, the user asked for 10G per cpu and also 
asked for two tasks. But since the job is exclusive, the job gets 48 
CPUs. SLURM now multiplies the number of CPUs with the requested mem per 
cpu. So, TRES has more memory than the node has altogether.
To make it even worse, I increased the memory:
#SBATCH --mem-per-cpu=90000

output of the job(to be clear: scontrol show jobid):
    NumNodes=1 NumCPUs=48 NumTasks=2 CPUs/Task=1 ReqB:S:C:T=0:0:*:*

I definitely know, we have no node with 4 TB memory. Might be, that I 
misunderstood the TRES at all, but I thought, they reflect the 
(theoretical) usage of the job, what is blocked for other jobs.
The user does not have a chance to use the memory, he is being accounted 

I now looked into, what kind of cgroups are generated for these kind of 

excerpt from the manpage (slurm.conf)

>        RealMemory
>               Size  of  real  memory  on  the  node in megabytes (e.g. 
> "2048").  The default value is 1. Lowering RealMemory with the goal of 
> setting aside some amount for the OS and not available for job 
> allocations will not work as
>               intended if Memory is not set as a consumable resource 
> in SelectTypeParameters. So one of the *_Memory options need to be 
> enabled for that goal to be accomplished.  Also see MemSpecLimit.

So 187200 MB should be available at max for a job on this partition, 
right? So, what happens, if we omit the --mem-per-cpu option? We get 
exactly the number of CPUs times DefMemPerCPU. The billing therefore is 
48, as intended.
But what happens, if we e.g. set --mem-per-cpu to 10000 MB?
This are 191905 MB, as seen on the OS by free -m, and is more than the 
defined 187200MB RealMemory.
As we have no swap enabled on the nodes, this means, the job could crash 
the node.

This at least does not "feel" right.


>      Just made another test.
>      Thanks god, the exclusivity is not "destroyed" completely, only on job
>      can run on the node, when the job is exclusive. Nonetheless, this is
>      somewhat unintuitive.
>      I wonder, if that also has an influence on the cgroups and the process
>      affinity/binding.
>      I will do some more tests.
>      Best
>      Marcus
