[slurm-users] Slurm 18.08.8 --mem-per-cpu + --exclusive = strange behavior
Beatrice Charton
beatrice.charton at criann.fr
Mon Dec 16 08:49:39 UTC 2019
Hi Marcus and Bjørn-Helge
Thank you for your answers.
We don’t use slurm billing. We use system acct billing.
I also confirm that with --exclusive, there is a difference between ReqCPUS and AllocCPUS, but --mem-per-cpu was more a --mem-per-task than a --mem-per-cpu : it was associated to ReqCPUS. It looks like now it is associated to AllocCPUS.
If it’s not a side effect, why do jobs and not rejected instead of accepted and Pending for ever ?
The behaviour is the same in 19.05.2 but recorrected in 19.05.3 so the problem seems to be known in v19 but not corrected in v18.
Sincerely,
Béatrice
> Le 12 déc. 2019 à 12:10, Marcus Wagner <wagner at itc.rwth-aachen.de> a écrit :
>
> Hi Beatrice and Bjørn-Helge,
>
> I can sign, that it works with 18.08.7. We additionally use TRESBillingWeights together with PriorityFlags=MAX_TRES. For example:
> TRESBillingWeights="CPU=1.0,Mem=0.1875G,gres/gpu=12.0"
> We use the billing factor for our external accounting. We do this to do a fair accounting of the nodes. But we do have a similar effect due to --exclusive.
> In Beatrice case, the billingweight would be:
> TRESBillingWeights="CPU=1.0,Mem=0.21875G"
> So, a 10 cpu job with 1 GB per cpu would be billed 10.
> An 1 cpu job with 10 GB would be billed 2 (0.21875*10, floor).
> An exclusive 10 cpu job with 1 GB per cpu would be billed 28 (all 28 cores are for the job).
> An exclusive 1 cpu job with 30GB (Beatrice' example) would be billed 28(cores)*30(GB)*0.21875 => 118.125 => 118 cores.
>
> Best
> Marcus
>
> On 12/12/19 9:47 AM, Bjørn-Helge Mevik wrote:
>> Beatrice Charton <beatrice.charton at criann.fr> writes:
>>
>>> Hi,
>>>
>>> We have a strange behaviour of Slurm after updating from 18.08.7 to
>>> 18.08.8, for jobs using --exclusive and --mem-per-cpu.
>>>
>>> Our nodes have 128GB of memory, 28 cores.
>>> $ srun --mem-per-cpu=30000 -n 1 --exclusive hostname
>>> => works in 18.08.7
>>> => doesn’t work in 18.08.8
>> I'm actually surprised it _worked_ in 18.08.7. At one time - long before
>> v 18.08, the behaviour was changed when using --exclusive: In order to
>> account the job for all cpus on the node, the number of
>> cpus asked for with --ntasks would simply be multiplied with with
>> "#cpus-on-node / --ntasks" (so in your case: 28). Unfortunately, that
>> also means that the memory the job requires per node is "#cpus-on-node /
>> --ntasks" multiplied with --mem-per-cpu (in your case 28 * 30000 MiB ~=
>> 820 GiB). For this reason, we tend to ban --exclusive on our clusters
>> (or at least warn about it).
>>
>> I haven't looked at the code for a long time, so I don't know whether
>> this is still the current behaviour, but every time I've tested, I've
>> seen the same problem. I believe I've tested on 19.05 (but I might
>> remember wrong).
>>
>
> --
> Marcus Wagner, Dipl.-Inf.
>
> IT Center
> Abteilung: Systeme und Betrieb
> RWTH Aachen University
> Seffenter Weg 23
> 52074 Aachen
> Tel: +49 241 80-24383
> Fax: +49 241 80-624383
> wagner at itc.rwth-aachen.de
> www.itc.rwth-aachen.de
>
>
--
Béatrice CHARTON | CRIANN
Beatrice.Charton at criann.fr | 745, avenue de l'Université
Tel : +33 (0)2 32 91 42 91 | 76800 Saint Etienne du Rouvray
--- Support : support at criann.fr ---
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 2234 bytes
Desc: not available
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20191216/6290ed4e/attachment.bin>
More information about the slurm-users
mailing list