[slurm-users] Jobs can grow in RAM usage surpassing MaxMemPerNode

Fri Jan 13 13:43:19 UTC 2023

Many thanks Rodrigo and Daniel,
Indeed I misunderstood that part of Slurm, so thanks for clarifying this
aspect now it makes a lot of sense.
Regarding the approach, I went with the cgroup.conf approach as suggested
by both.
I will start doing some synthetic tests to make sure the job gets killed
once it surpasses memory.
many thanks again

On Fri, Jan 13, 2023 at 3:49 AM Daniel Letai <dani at letai.org.il> wrote:

> Hello Cristóbal,
>
>
> I think you might have a slight misunderstanding of how Slurm works, which
> can cause this difference in expectation.
>
>
> The MaxMemPerNode is there to allow the scheduler to plan job placement
> according to resources. It does not enforce limitations during job
> execution, only placement with the assumption that the job will not use
> more than the resources it requested.
>
>
> One option to limit the job during execution is through cgroups, another
> might be using *JobAcctGatherParams/**OverMemoryKill *but I would suspect
> cgroups would indeed be the better option for your use case, and see from
> the slurm.conf man page:
>
>
> Kill processes that are being detected to use more memory than requested
> by steps every time accounting information is gathered by the JobAcctGather
> plugin. This parameter should be used with caution because a job exceeding
> its memory allocation may affect other processes and/or machine health.
>
> *NOTE*: If available, it is recommended to limit memory by enabling
> task/cgroup as a TaskPlugin and making use of ConstrainRAMSpace=yes in the
> cgroup.conf instead of using this JobAcctGather mechanism for memory
> enforcement. Using JobAcctGather is polling based and there is a delay
> before a job is killed, which could lead to system Out of Memory events.
>
> *NOTE*: When using *OverMemoryKill*, if the combined memory used by all
> the processes in a step exceeds the memory limit, the entire step will be
> killed/cancelled by the JobAcctGather plugin. This differs from the
> behavior when using *ConstrainRAMSpace*, where processes in the step will
> be killed, but the step will be left active, possibly with other processes
> left running.
>
>
>
> On 12/01/2023 03:47:53, Cristóbal Navarro wrote:
>
> Hi Slurm community,
> Recently we found a small problem triggered by one of our jobs. We have a
> *MaxMemPerNode*=*532000* setting in our compute node in slurm.conf file,
> however we found out that a job that started with mem=65536, and after
> hours of execution it was able to grow its memory usage during execution up
> to ~650GB. We expected that *MaxMemPerNode* would stop any job exceeding
> the limit of 532000, did we miss something in the slurm.conf file? We were
> trying to avoid going into setting QOS for each group of users.
> any help is welcome
>
> Here is the node definition in the conf file
> ## Nodes list
> ## use native GPUs
> NodeName=nodeGPU01 SocketsPerBoard=8 CoresPerSocket=16 ThreadsPerCore=1
> RealMemory=1024000 MemSpecLimit=65556 State=UNKNOWN Gres=gpu:A100:8
> Feature=gpu
>
>
> And here is the full slurm.conf file
> # node health check
> HealthCheckProgram=/usr/sbin/nhc
> HealthCheckInterval=300
>
> ## Timeouts
> SlurmctldTimeout=600
> SlurmdTimeout=600
>
> GresTypes=gpu
> AccountingStorageTRES=gres/gpu
> DebugFlags=CPU_Bind,gres
>
> ## We don't want a node to go back in pool without sys admin
> acknowledgement
> ReturnToService=0
>
> ## Basic scheduling
> SelectType=select/cons_tres
> SelectTypeParameters=CR_Core_Memory,CR_ONE_TASK_PER_CORE
> SchedulerType=sched/backfill
>
> ## Accounting
> AccountingStorageType=accounting_storage/slurmdbd
> AccountingStoreJobComment=YES
> AccountingStorageHost=10.10.0.1
> AccountingStorageEnforce=limits
>
> JobAcctGatherFrequency=30
> JobAcctGatherType=jobacct_gather/linux
>
> TaskPlugin=task/cgroup
> ProctrackType=proctrack/cgroup
>
> ## scripts
> Epilog=/etc/slurm/epilog
> Prolog=/etc/slurm/prolog
> PrologFlags=Alloc
>
> ## MPI
> MpiDefault=pmi2
>
> ## Nodes list
> ## use native GPUs
> NodeName=nodeGPU01 SocketsPerBoard=8 CoresPerSocket=16 ThreadsPerCore=1
> RealMemory=1024000 MemSpecLimit=65556 State=UNKNOWN Gres=gpu:A100:8
> Feature=gpu
>
> ## Partitions list
> PartitionName=gpu OverSubscribe=No MaxCPUsPerNode=64 DefMemPerNode=65556
> DefCpuPerGPU=8 DefMemPerGPU=65556 MaxMemPerNode=532000 MaxTime=3-12:00:00
> State=UP Nodes=nodeGPU01 Default=YES
> PartitionName=cpu OverSubscribe=No MaxCPUsPerNode=64 DefMemPerNode=16384
> MaxMemPerNode=420000 MaxTime=3-12:00:00 State=UP Nodes=nodeGPU01
>
>
> --
> Cristóbal A. Navarro
>
> --
> Regards,
>
> Daniel Letai
> +972 (0)505 870 456
>
>

-- 
Cristóbal A. Navarro
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20230113/fa0e1241/attachment.htm>