[slurm-users] Using free memory available when allocating a node to a job

John Hearns hearnsj at googlemail.com
Tue May 29 04:39:23 MDT 2018


Also regarding memory, there are system tunings you can set for the
behaviour of the OurOfMemory Killer and also the VM overcommit.

I have seen the VM overcommit parameters being discussed elsewhere, and
generally for HPC people advise to disable overcommit
https://www.suse.com/support/kb/doc/?id=7002775
This of course is very dependent on what your environment and applications
are. Would you be able to say please what the problems you are having with
memory?






On 29 May 2018 at 12:26, John Hearns <hearnsj at googlemail.com> wrote:

> Alexandre,   it would be helpful if you could say why this behaviour is
> desirable.
> For instance, do you have codes which need a large amount of memory and
> your users are seeing that these codes are crashing because other codes
> running on the same nodes are using memory.
>
> I have two thoughts:
>
> A) enable job exclusive - ie run one job on one compute node. Then that
> job has all the memory.
> This is a very good way to run HPC in my experience.  Yes I know it is
> inefficient if there are lots of single core jobs.
> SO this depends on what your mix of jobs is.
>
> B) Have you considered implementing cgroups?  Then each job will be
> allocated memory and cpu cores.
> Jobs will not be able to grow larger than their allocated cgroup limits.
>
> I would really ask you to consider cgroups.
>
>
> On 29 May 2018 at 11:34, PULIDO, Alexandre <alexandre.pulido at ariane.group>
> wrote:
>
>> Hi,
>>
>>
>>
>> in the cluster where I'm deploying Slurm the job allocation has to be
>> based on the actual free memory available on the node, not just the
>> allocated by Slurm. This is nonnegotiable and I understand that it's not
>> how Slurm is designed to work, but I'm trying anyway.
>>
>>
>>
>> Among the solutions that I'm envisaging:
>>
>>
>>
>> 1) Create and update periodically a numerical node feature, with a string
>> and a special character separating the wanted value (memfree_2048). This
>> definitely seems like a mess to implement and too hacky, but is there an
>> equivalent to PBS' numerical complexes and sensors in Slurm?
>>
>>
>>
>> 2) Modifying the select cons_res pluging to compare against the actual
>> free memory instead of the allocated memory. Is it as simple as editing the
>> "_add_job_to_res" (https://github.com/SchedMD/sl
>> urm/blob/master/src/plugins/select/cons_res/select_cons_res.c#L816)
>> function and using the real left memory ? I don't want to break anything
>> else so that's my main question here, if you can guide me towards the
>> solution or other thoughts on its feasibility.
>>
>>
>>
>> Thanks a lot in advance!
>>
>>
>>
>> Best regards,
>>
>>
>>
>> [image: px]
>>
>> [image: px]
>>
>> *Alexandre PULIDO*
>>
>> [image: px]
>>
>> [image: arianegroup]
>>
>> [image: px]
>>
>> [image: px]
>>
>> [image: px]
>>
>>
>>
>>
>> Ce courriel (incluant ses éventuelles pièces jointes) peut contenir des
>> informations confidentielles et/ou protégées ou dont la diffusion est
>> restreinte ou soumise aux règlementations relatives au contrôle des
>> exportations ou ayant un caractère privé. Si vous avez reçu ce courriel par
>> erreur, vous ne devez ni le reproduire, ni l'utiliser, ni en divulguer le
>> contenu à quiconque. Merci d'en avertir immédiatement l'expéditeur et de
>> supprimer de votre système informatique ce courriel ainsi que tous les
>> documents qui y sont attachés. Toute exportation ou réexportation non
>> autorisée est interdite. ArianeGroup SAS décline toute responsabilité en
>> cas de corruption par virus, d'altération ou de falsification de ce
>> courriel lors de sa transmission par voie électronique. This email
>> (including any attachments) may contain confidential or proprietary and/or
>> privileged information or information otherwise protected from disclosure
>> or may be subject to export control laws and regulations. If you are not
>> the intended recipient, please notify the sender immediately, do not
>> reproduce this message or any attachments and do not use it for any purpose
>> or disclose its content to any person, but delete this message and any
>> attachments from your system. Unauthorized export or re-export is
>> prohibited. ArianeGroup SAS disclaims any and all liability if this email
>> transmission was virus corrupted, altered or falsified. ArianeGroup SAS
>> (519 032 247 RCS PARIS) - Capital social : 265 904 408 EUR - Siège social :
>> Tour Cristal, 7-11 Quai André Citroën, 75015 Paris
>> <https://maps.google.com/?q=7-11+Quai+Andr%C3%A9+Citro%C3%ABn,+75015+Paris&entry=gmail&source=g>
>> - TVA FR 82 519 032 247 - APE/NAF 3030Z
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20180529/1a2b3335/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image005.png
Type: image/png
Size: 167 bytes
Desc: not available
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20180529/1a2b3335/attachment-0004.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image002.png
Type: image/png
Size: 167 bytes
Desc: not available
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20180529/1a2b3335/attachment-0005.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image001.png
Type: image/png
Size: 168 bytes
Desc: not available
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20180529/1a2b3335/attachment-0006.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image003.png
Type: image/png
Size: 168 bytes
Desc: not available
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20180529/1a2b3335/attachment-0007.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image004.jpg
Type: image/jpeg
Size: 5936 bytes
Desc: not available
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20180529/1a2b3335/attachment-0001.jpg>


More information about the slurm-users mailing list