[slurm-users] Strange memory limit behavior with --mem-per-gpu
Paul Raines
raines at nmr.mgh.harvard.edu
Thu Apr 7 13:56:15 UTC 2022
Basically, it appears using --mem-per-gpu instead of just --mem gives
you unlimited memory for your job.
$ srun --account=sysadm -p rtx8000 -N 1 --time=1-10:00:00
--ntasks-per-node=1 --cpus-per-task=1 --gpus=1 --mem-per-gpu=8G
--mail-type=FAIL --pty /bin/bash
rtx-07[0]:~$ find /sys/fs/cgroup/memory/ -name job_$SLURM_JOBID
/sys/fs/cgroup/memory/slurm/uid_5829/job_1134067
rtx-07[0]:~$ cat /sys/fs/cgroup/memory/slurm/uid_5829/job_1134067/memory.limit_in_bytes
1621419360256
That is a limit of 1.5TB which is all the memory on rtx-07, not
the 8G I effectively asked for at 1 GPU and 8G per GPU.
Using --mem works as normal
$ srun --account=sysadm -p rtx8000 -N 1 --time=1-10:00:00
--ntasks-per-node=1 --cpus-per-task=1 --gpus=1 --mem=8G --mail-type=FAIL
--pty /bin/bash
rtx-07[0]:~$ find /sys/fs/cgroup/memory/ -name job_$SLURM_JOBID
/sys/fs/cgroup/memory/slurm/uid_5829/job_1134068
rtx-07[0]:~$ cat /sys/fs/cgroup/memory/slurm/uid_5829/job_1134068/memory.limit_in_bytes
8589934592
On Wed, 6 Apr 2022 3:30pm, Paul Raines wrote:
>
> I have a user who submitted an interactive srun job using:
>
> srun --mem-per-gpu 64 --gpus 1 --nodes 1 ....
>
>> From sacct for this job we see:
>
> ReqTRES : billing=4,cpu=1,gres/gpu=1,mem=10G,node=1
> AllocTRES : billing=4,cpu=1,gres/gpu=1,mem=64M,node=1
>
> (where 10G I assume comes from the DefMemPerCPU=10240 set in slurm.conf)
>
> Now I think the user here made a mistake and 64M should be way too
> little for the job but it is running fine. They may have forgot the
> 'G' and meant to do 64G
>
> The user submitted two jobs just like this, and both are running on the same
> node where I see:
>
> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
> 5496 nms88 20 0 521.1g 453.2g 175852 S 100.0 30.0 1110:37 python
> 5555 nms88 20 0 484.7g 413.3g 182456 S 93.8 27.4 1065:22 python
>
> and if I cd to /sys/fs/cgroup/memory/slurm/uid_5143603/job_1120342
> for one of the jobs I see:
>
> # cat memory.limit_in_bytes
> 1621429846016
> # cat memory.usage_in_bytes
> 744443580416
>
> (the node itself has 1.5TB of RAM total)
>
> So my question is why did SLURM end up running the job this way? Why
> was the cgroup limit not 64MB which would have made the job fail
> with OOM pretty quickly?
>
> On someone else's job submitted with
>
> srun -N 1 --ntasks-per-node=1 --gpus=1 --mem=128G --cpus-per-task=3 ...
>
> on the node in the memory cgroup I see the expected
>
> # cat memory.limit_in_bytes
> 137438953472
>
> But I worry it could fail since those other two jobs are essentially
> consuming all the memory.
>
> ---------------------------------------------------------------
> Paul Raines http://help.nmr.mgh.harvard.edu
> MGH/MIT/HMS Athinoula A. Martinos Center for Biomedical Imaging
> 149 (2301) 13th Street Charlestown, MA 02129 USA
>
>
>
> The information in this e-mail is intended only for the person to whom it is
> addressed. If you believe this e-mail was sent to you in error and the
> e-mail contains patient information, please contact the Mass General Brigham
> Compliance HelpLine at https://www.massgeneralbrigham.org/complianceline
> <https://www.massgeneralbrigham.org/complianceline> .
> Please note that this e-mail is not secure (encrypted). If you do not wish
> to continue communication over unencrypted e-mail, please notify the sender
> of this message immediately. Continuing to send or respond to e-mail after
> receiving this message means you understand and accept this risk and wish to
> continue to communicate over unencrypted e-mail.
>
>
>
The information in this e-mail is intended only for the person to whom it is addressed. If you believe this e-mail was sent to you in error and the e-mail contains patient information, please contact the Mass General Brigham Compliance HelpLine at https://www.massgeneralbrigham.org/complianceline <https://www.massgeneralbrigham.org/complianceline> .
Please note that this e-mail is not secure (encrypted). If you do not wish to continue communication over unencrypted e-mail, please notify the sender of this message immediately. Continuing to send or respond to e-mail after receiving this message means you understand and accept this risk and wish to continue to communicate over unencrypted e-mail.
More information about the slurm-users
mailing list