[slurm-users] MaxMemPerCPU not enforced?
Angel de Vicente
angel.de.vicente at iac.es
Mon Jul 24 21:53:05 UTC 2023
Hello,
Matthew Brown <brownm12 at vt.edu> writes:
> Minimum memory required per allocated CPU. ... Note that if the job's
> --mem-per-cpu value exceeds the configured MaxMemPerCPU, then the
> user's limit will be treated as a memory limit per task
Ah, thanks, I should've read the documentation more carefully.
From my limited tests today, somehow in the interactive queue all seems
OK now, but not so in the 'batch' queue. For example, I just submitted
three jobs with different amount of CPUs per job (4, 8 and 16 processes
respectively). MaxMemPerCPU is set to 2GB, and these jobs run the
'stress' command, consuming 3GB per process.
,----
| [user at xxx test]$ squeue
| JOBID PARTITION NAME USER ST TIME TIME_LIMIT CPUS QOS ACCOUNT NODELIST(REASON)
| 127564 batch test user R 9:25 15:00 16 normal ddgroup xxx
| 127562 batch test user R 9:25 15:00 4 normal ddgroup xxx
| 127563 batch test user R 9:25 15:00 8 normal ddgroup xxx
`----
It looks like Slurm is trying to kill the jobs, but somehow not all the
processes die (as you can see below, 2 out of the 4 processes in job
127562 are still there after 9 minutes, 3 of the 8 proceeses in job
127563 and 6 of the 16 processes in job 127564):
,----
| [user at xxx test]$ ps -fea | grep stress
| user 1853317 1853314 0 22:35 ? 00:00:00 stress -m 16 -t 600 --vm-keep --vm-bytes 3G
| user 1853319 1853317 66 22:35 ? 00:06:17 stress -m 16 -t 600 --vm-keep --vm-bytes 3G
| user 1853320 1853317 65 22:35 ? 00:06:11 stress -m 16 -t 600 --vm-keep --vm-bytes 3G
| user 1853321 1853317 65 22:35 ? 00:06:11 stress -m 16 -t 600 --vm-keep --vm-bytes 3G
| user 1853328 1853317 65 22:35 ? 00:06:12 stress -m 16 -t 600 --vm-keep --vm-bytes 3G
| user 1853329 1853317 65 22:35 ? 00:06:12 stress -m 16 -t 600 --vm-keep --vm-bytes 3G
| user 1853338 1853337 0 22:35 ? 00:00:00 stress -m 8 -t 600 --vm-keep --vm-bytes 3G
| user 1853340 1853338 68 22:35 ? 00:06:32 stress -m 8 -t 600 --vm-keep --vm-bytes 3G
| user 1853341 1853338 69 22:35 ? 00:06:34 stress -m 8 -t 600 --vm-keep --vm-bytes 3G
| user 1853347 1853316 0 22:35 ? 00:00:00 stress -m 4 -t 600 --vm-keep --vm-bytes 3G
| user 1853350 1853347 68 22:35 ? 00:06:29 stress -m 4 -t 600 --vm-keep --vm-bytes 3G
| user 1854560 1511070 0 22:45 pts/2 00:00:00 grep stress
`----
And these processes are truly using 3GB:
,----
| [user at xxx test]$ ps -v 1853319
| PID TTY STAT TIME MAJFL TRS DRS RSS %MEM COMMAND
| 1853319 ? R 6:25 8642 11 3149428 3146040 1.1 stress -m 16 -t 600 --vm-keep --vm-bytes 3G
`----
Any idea how to solve/debug this?
Many thanks,
--
Ángel de Vicente
Research Software Engineer (Supercomputing and BigData)
Tel.: +34 922-605-747
Web.: http://research.iac.es/proyecto/polmag/
GPG: 0x8BDC390B69033F52
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 5877 bytes
Desc: not available
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20230724/22c98adc/attachment.bin>
More information about the slurm-users
mailing list