[slurm-users] Slurm 18.08.8 --mem-per-cpu + --exclusive = strange behavior

Marcus Wagner wagner at itc.rwth-aachen.de
Thu Dec 12 11:10:20 UTC 2019


Hi Beatrice and Bjørn-Helge,

I can sign, that it works with 18.08.7. We additionally use 
TRESBillingWeights together with PriorityFlags=MAX_TRES. For example:
TRESBillingWeights="CPU=1.0,Mem=0.1875G,gres/gpu=12.0"
We use the billing factor for our external accounting. We do this to do 
a fair accounting of the nodes. But we do have a similar effect due to 
--exclusive.
In Beatrice case, the billingweight would be:
TRESBillingWeights="CPU=1.0,Mem=0.21875G"
So, a 10 cpu job with 1 GB per cpu would be billed 10.
An 1 cpu job with 10 GB would be billed 2 (0.21875*10, floor).
An exclusive 10 cpu job with 1 GB per cpu would be billed 28 (all 28 
cores are for the job).
An exclusive 1 cpu job with 30GB (Beatrice' example) would be billed 
28(cores)*30(GB)*0.21875 => 118.125 => 118 cores.

Best
Marcus

On 12/12/19 9:47 AM, Bjørn-Helge Mevik wrote:
> Beatrice Charton <beatrice.charton at criann.fr> writes:
>
>> Hi,
>>
>> We have a strange behaviour of Slurm after updating from 18.08.7 to
>> 18.08.8, for jobs using --exclusive and --mem-per-cpu.
>>
>> Our nodes have 128GB of memory, 28 cores.
>> 	$ srun  --mem-per-cpu=30000 -n 1  --exclusive  hostname
>> => works in 18.08.7
>> => doesn’t work in 18.08.8
> I'm actually surprised it _worked_ in 18.08.7.  At one time - long before
> v 18.08, the behaviour was changed when using --exclusive: In order to
> account the job for all cpus on the node, the number of
> cpus asked for with --ntasks would simply be multiplied with with
> "#cpus-on-node / --ntasks" (so in your case: 28).  Unfortunately, that
> also means that the memory the job requires per node is "#cpus-on-node /
> --ntasks" multiplied with --mem-per-cpu (in your case 28 * 30000 MiB ~=
> 820 GiB).  For this reason, we tend to ban --exclusive on our clusters
> (or at least warn about it).
>
> I haven't looked at the code for a long time, so I don't know whether
> this is still the current behaviour, but every time I've tested, I've
> seen the same problem.  I believe I've tested on 19.05 (but I might
> remember wrong).
>

-- 
Marcus Wagner, Dipl.-Inf.

IT Center
Abteilung: Systeme und Betrieb
RWTH Aachen University
Seffenter Weg 23
52074 Aachen
Tel: +49 241 80-24383
Fax: +49 241 80-624383
wagner at itc.rwth-aachen.de
www.itc.rwth-aachen.de




More information about the slurm-users mailing list