<div dir="ltr"><div>Many thanks Rodrigo and Daniel,</div><div>Indeed I misunderstood that part of Slurm, so thanks for clarifying this aspect now it makes a lot of sense.<br></div><div>Regarding the approach, I went with the cgroup.conf approach as suggested by both. <br></div><div>I will start doing some synthetic tests to make sure the job gets killed once it surpasses memory.</div><div>many thanks again<br></div><div><br></div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Fri, Jan 13, 2023 at 3:49 AM Daniel Letai <<a href="mailto:dani@letai.org.il">dani@letai.org.il</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div class="msg-4983056948497144032">

  <div style="direction:ltr">

    <p>Hello Cristóbal,</p>

    <p><br>

    </p>

    <p>I think you might have a slight misunderstanding of how Slurm

      works, which can cause this difference in expectation.</p>

    <p><br>

    </p>

    <p>The MaxMemPerNode is there to allow the scheduler to plan job

      placement according to resources. It does not enforce limitations

      during job execution, only placement with the assumption that the

      job will not use more than the resources it requested.</p>

    <p><br>

    </p>

    <p>One option to limit the job during execution is through cgroups,

      another might be using <b>JobAcctGatherParams/</b><b>OverMemoryKill

      </b>but I would suspect cgroups would indeed be the better option

      for your use case, and see from the slurm.conf man page:</p>

    <p><br>

    </p>

    <p>

      </p><blockquote type="cite">

        <p>Kill processes that are being detected to use more memory

          than requested by

          steps every time accounting information is gathered by the

          JobAcctGather plugin.

          This parameter should be used with caution because a job

          exceeding its memory

          allocation may affect other processes and/or machine health.

        </p>

        <p>

          <b>NOTE</b>: If available, it is recommended to limit memory

          by enabling

          task/cgroup as a TaskPlugin and making use of

          ConstrainRAMSpace=yes in the

          cgroup.conf instead of using this JobAcctGather mechanism for

          memory

          enforcement. Using JobAcctGather is polling based and there is

          a

          delay before a job is killed, which could lead to system Out

          of Memory events.

        </p>

        <p>

          <b>NOTE</b>: When using <b>OverMemoryKill</b>, if the

          combined memory used by

          all the processes in a step exceeds the memory limit, the

          entire step will be

          killed/cancelled by the JobAcctGather plugin.

          This differs from the behavior when using <b>ConstrainRAMSpace</b>,

          where

          processes in the step will be killed, but the step will be

          left active,

          possibly with other processes left running.</p>

      </blockquote>

      <br>

    <p></p>

    <p><br>

    </p>

    <div>On 12/01/2023 03:47:53, Cristóbal

      Navarro wrote:<br>

    </div>

    <blockquote type="cite">

      <div dir="ltr">

        <div>Hi Slurm community,</div>

        <div>Recently we found a small problem triggered by one of our

          jobs. We have a <b>MaxMemPerNode</b>=<b>532000</b> setting in

          our compute node in slurm.conf file, however we found out that

          a job that started with mem=65536, and after hours of

          execution it was able to grow its memory usage during

          execution up to ~650GB. We expected that <b>MaxMemPerNode</b>

          would stop any job exceeding the limit of 532000, did we miss

          something in the slurm.conf file? We were trying to avoid

          going into setting QOS for each group of users.<br>

        </div>

        <div>any help is welcome<br>

        </div>

        <div><br>

        </div>

        <div>Here is the node definition in the conf file</div>

        <div><span style="font-family:monospace">## Nodes list<br>

            ## use native GPUs<br>

            NodeName=nodeGPU01 SocketsPerBoard=8 CoresPerSocket=16

            ThreadsPerCore=1 RealMemory=1024000 MemSpecLimit=65556

            State=UNKNOWN Gres=gpu:A100:8 Feature=gpu</span></div>

        <div><span style="font-family:monospace"><br>

          </span></div>

        <div><span style="font-family:monospace"><br>

          </span></div>

        <div><span style="font-family:monospace"><font face="arial,sans-serif">And here is the full slurm.conf

              file</font><br>

          </span></div>

        <div><span style="font-family:monospace"># node health check<br>

            HealthCheckProgram=/usr/sbin/nhc<br>

            HealthCheckInterval=300<br>

            <br>

            ## Timeouts<br>

            SlurmctldTimeout=600<br>

            SlurmdTimeout=600<br>

            <br>

            GresTypes=gpu<br>

            AccountingStorageTRES=gres/gpu<br>

            DebugFlags=CPU_Bind,gres<br>

            <br>

            ## We don't want a node to go back in pool without sys admin

            acknowledgement<br>

            ReturnToService=0<br>

            <br>

            ## Basic scheduling<br>

            SelectType=select/cons_tres<br>

            SelectTypeParameters=CR_Core_Memory,CR_ONE_TASK_PER_CORE<br>

            SchedulerType=sched/backfill<br>

            <br>

            ## Accounting <br>

            AccountingStorageType=accounting_storage/slurmdbd<br>

            AccountingStoreJobComment=YES<br>

            AccountingStorageHost=10.10.0.1<br>

            AccountingStorageEnforce=limits<br>

            <br>

            JobAcctGatherFrequency=30<br>

            JobAcctGatherType=jobacct_gather/linux<br>

            <br>

            TaskPlugin=task/cgroup<br>

            ProctrackType=proctrack/cgroup<br>

            <br>

            ## scripts<br>

            Epilog=/etc/slurm/epilog<br>

            Prolog=/etc/slurm/prolog<br>

            PrologFlags=Alloc<br>

            <br>

            ## MPI<br>

            MpiDefault=pmi2<br>

            <br>

            ## Nodes list<br>

            ## use native GPUs<br>

            NodeName=nodeGPU01 SocketsPerBoard=8 CoresPerSocket=16

            ThreadsPerCore=1 RealMemory=1024000 MemSpecLimit=65556

            State=UNKNOWN Gres=gpu:A100:8 Feature=gpu<br>

            <br>

            ## Partitions list<br>

            PartitionName=gpu OverSubscribe=No MaxCPUsPerNode=64

            DefMemPerNode=65556 DefCpuPerGPU=8 DefMemPerGPU=65556

            MaxMemPerNode=532000 MaxTime=3-12:00:00 State=UP

            Nodes=nodeGPU01 Default=YES <br>

            PartitionName=cpu OverSubscribe=No MaxCPUsPerNode=64

            DefMemPerNode=16384 MaxMemPerNode=420000 MaxTime=3-12:00:00

            State=UP Nodes=nodeGPU01</span> <br>

        </div>

        <div><br clear="all">

          <br>

          -- <br>

          <div dir="ltr">

            <div dir="ltr">

              <div>Cristóbal A. Navarro</div>

            </div>

          </div>

        </div>

      </div>

    </blockquote>

    <pre cols="72">-- 

Regards,

Daniel Letai

+972 (0)505 870 456</pre>

  </div>

</div></blockquote></div><br clear="all"><br>-- <br><div dir="ltr" class="gmail_signature"><div dir="ltr"><div>Cristóbal A. Navarro</div></div></div>