[slurm-users] srun --mem issue

René Sitt sittr at hrz.uni-marburg.de
Thu Dec 8 10:39:21 UTC 2022


Hi,

same here - since RealMemory will almost always be < Free Memory, 
setting --mem=0 will get the job rejected. Downside is that we have to 
sensitize our users to request a little less than the 'theoretical 
maximum' of the nodes - I have some heuristics in job_submit.lua to 
output hints at job submit for cases where job reservations are very 
near or slightly over a node type's max. free memory.

Kind regards,
René Sitt

Am 08.12.22 um 11:28 schrieb Loris Bennett:
> Hi Moshe,
>
> Moshe Mergy <moshe.mergy at weizmann.ac.il> writes:
>
>> Hi Loris
>>
>> indeed  https://slurm.schedmd.com/resource_limits.html explains the possibilities of limitations
>>
>> At present time, I do no limit memory for specific users, but just a global limitation in slurm.conf:
>>
>>    MaxMemPerNode=65536 (for 64 GB limitation)
>>
>> But... anyway, for my Slurm version 20.02, any user can obtain MORE than 64 GB of memory by using the "--mem=0" option !
>>
>> So I had to filter this in  job_submit.lua
> We don't use MaxMemPerNode but define RealMemory for groups of nodes
> which have the same amount of RAM.  We share the nodes and use
>
>    SelectType=select/cons_res
>    SelectTypeParameters=CR_Core_Memory
>
> So a job can't start on a node if it requests more memory than
> available, i.e. more than RealMemory minus memory already committed to
> other jobs, even if --mem=0 is specified (I guess).
>
> Cheers,
>
> Loris
>
>> -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>> From: slurm-users <slurm-users-bounces at lists.schedmd.com> on behalf of Loris Bennett <loris.bennett at fu-berlin.de>
>> Sent: Thursday, December 8, 2022 10:57:56 AM
>> To: Slurm User Community List
>> Subject: Re: [slurm-users] srun --mem issue
>>   
>> Loris Bennett <loris.bennett at fu-berlin.de> writes:
>>
>>> Moshe Mergy <moshe.mergy at weizmann.ac.il> writes:
>>>
>>>> Hi Sandor
>>>>
>>>> I personnaly block "--mem=0" requests in file job_submit.lua (slurm 20.02):
>>>>
>>>>    if (job_desc.min_mem_per_node == 0  or  job_desc.min_mem_per_cpu == 0) then
>>>>          slurm.log_info("%s: ERROR: unlimited memory requested", log_prefix)
>>>>          slurm.log_info("%s: ERROR: job %s from user %s rejected because of an invalid (unlimited) memory request.", log_prefix, job_desc.name, job_desc.user_name)
>>>>          slurm.log_user("Job rejected because of an invalid memory request.")
>>>>          return slurm.ERROR
>>>>     end
>>> What happens if somebody explicitly requests all the memory, so in
>>> Sandor's case --mem=500G ?
>>>
>>>> Maybe there is a better or nicer solution...
>> Can't you just use account and QOS limits:
>>
>>    https://slurm.schedmd.com/resource_limits.html
>>
>> ?
>>
>> And anyway, what is the use-case for preventing someone using all the
>> memory? In our case, if someone really need all the memory, they should be able
>> to have it.
>>
>> However, I do have a chronic problem with users requesting too much
>> memory. My approach has been to try to get people to use 'seff' to see
>> what resources their jobs in fact need.  In addition each month we
>> generate a graphical summary of 'seff' data for each user, like the one
>> shown here
>>
>>    https://www.fu-berlin.de/en/sites/high-performance-computing/Dokumentation/Statistik
>>
>> and automatically send an email to those with a large percentage of
>> resource-inefficient jobs telling them to look at their graphs and
>> correct their resource requirements for future jobs.
>>
>> Cheers,
>>
>> Loris
>>
>>>> All the best
>>>> Moshe
>>>>
>>>>
>> -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>>>> From: slurm-users <slurm-users-bounces at lists.schedmd.com> on behalf of Felho, Sandor <Sandor.Felho at transunion.com>
>>>> Sent: Wednesday, December 7, 2022 7:03 PM
>>>> To: slurm-users at lists.schedmd.com
>>>> Subject: [slurm-users] srun --mem issue
>>>>   
>>>> TransUnion is running a ten-node site using slurm with multiple queues. We have an issue with --mem parameter. The is one user who has read the slurm manual and found the
>>>> --mem=0. This is giving the maximum memory on the node (500 GiB's) for the single job. How can I block a --mem=0 request?
>>>>
>>>> We are running:
>>>>
>>>> * OS: RHEL 7
>>>> * cgroups version 1
>>>> * slurm: 19.05
>>>>
>>>> Thank you,
>>>>
>>>> Sandor Felho
>>>>
>>>> Sr Consultant, Data Science & Analytics
>>>>
-- 
Dipl.-Chem. René Sitt
Hessisches Kompetenzzentrum für Hochleistungsrechnen
Philipps-Universität Marburg
Hans-Meerwein-Straße
35032 Marburg

Tel. +49 6421 28 23523
sittr at hrz.uni-marburg.de
www.hkhlr.de

-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 4239 bytes
Desc: S/MIME Cryptographic Signature
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20221208/6d0b96a0/attachment.bin>


More information about the slurm-users mailing list