[slurm-users] Strange memory limit behavior with --mem-per-gpu

Paul Raines raines at nmr.mgh.harvard.edu
Wed Apr 6 19:30:14 UTC 2022


I have a user who submitted an interactive srun job using:

srun --mem-per-gpu 64 --gpus 1 --nodes 1 ....

>From sacct for this job we see:

         ReqTRES : billing=4,cpu=1,gres/gpu=1,mem=10G,node=1
       AllocTRES : billing=4,cpu=1,gres/gpu=1,mem=64M,node=1

(where 10G I assume comes from the DefMemPerCPU=10240 set in slurm.conf)

Now I think the user here made a mistake and 64M should be way too
little for the job but it is running fine.  They may have forgot the
'G' and meant to do 64G

The user submitted two jobs just like this, and both are running on the 
same node where I see:

  PID USER   PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
5496 nms88  20   0  521.1g 453.2g 175852 S 100.0  30.0   1110:37 python
5555 nms88  20   0  484.7g 413.3g 182456 S  93.8  27.4   1065:22 python

and if I cd to /sys/fs/cgroup/memory/slurm/uid_5143603/job_1120342
for one of the jobs I see:

# cat memory.limit_in_bytes
1621429846016
# cat memory.usage_in_bytes
744443580416

(the node itself has 1.5TB of RAM total)

So my question is why did SLURM end up running the job this way?  Why
was the cgroup limit not 64MB which would have made the job fail
with OOM pretty quickly?

On someone else's job submitted with

srun -N 1 --ntasks-per-node=1 --gpus=1 --mem=128G --cpus-per-task=3 ...

on the node in the memory cgroup I see the expected

# cat memory.limit_in_bytes
137438953472

But I worry it could fail since those other two jobs are essentially
consuming all the memory.

---------------------------------------------------------------
Paul Raines                     http://help.nmr.mgh.harvard.edu
MGH/MIT/HMS Athinoula A. Martinos Center for Biomedical Imaging
149 (2301) 13th Street     Charlestown, MA 02129	    USA



The information in this e-mail is intended only for the person to whom it is addressed.  If you believe this e-mail was sent to you in error and the e-mail contains patient information, please contact the Mass General Brigham Compliance HelpLine at https://www.massgeneralbrigham.org/complianceline <https://www.massgeneralbrigham.org/complianceline> .
Please note that this e-mail is not secure (encrypted).  If you do not wish to continue communication over unencrypted e-mail, please notify the sender of this message immediately.  Continuing to send or respond to e-mail after receiving this message means you understand and accept this risk and wish to continue to communicate over unencrypted e-mail. 




More information about the slurm-users mailing list