Angel,
Unless you are using cgroups and constraints, there is no limit imposed. The numbers are used by slurm to track what is available, not what you may/may not use. So you could tell slurm the node only has 1GB and it will not let you request more than that, but if you do request only 1GB, without specific configuration, there is nothing stopping you from using more than that.
So your request did not exceed what slurm sees as available (1 cpu using 4GB), so it is happy to let your script run. I suspect if you look at the usage, you will see that 1 cpu spiked high while the others did nothing.
Brian Andrus
On 9/4/2024 1:37 AM, Angel de Vicente via slurm-users wrote:
Hello,
we found an issue with Slurm 24.05.1 and the MaxMemPerNode setting. Slurm is installed in a single workstation, and thus, the number of nodes is just 1.
The relevant sections in slurm.conf read:
,---- | EnforcePartLimits=ALL | PartitionName=short Nodes=..... State=UP Default=YES MaxTime=2-00:00:00 MaxCPUsPerNode=76 MaxMemPerNode=231000 OverSubscribe=FORCE:1 `----
Now, if I submit a job requesting 76 CPUs and each one needing 4000M (for a total of 304000M), Slurm does indeed respect the MaxMemPerNode setting and the job is not submitted in the following cases ("-N 1" is not really necessary, as there is only one node):
,---- | $ sbatch -N 1 -n 1 -c 76 -p short --mem-per-cpu=4000M test.batch | sbatch: error: Batch job submission failed: Memory required by task is not available | | $ sbatch -N 1 -n 76 -c 1 -p short --mem-per-cpu=4000M test.batch | sbatch: error: Batch job submission failed: Memory required by task is not available | | $ sbatch -n 1 -c 76 -p short --mem-per-cpu=4000M test.batch | sbatch: error: Batch job submission failed: Memory required by task is not available `----
But with this submission Slurm is happy:
,---- | $ sbatch -n 76 -c 1 -p short --mem-per-cpu=4000M test.batch | Submitted batch job 133982 `----
and the slurmjobcomp.log file does indeed tell me that the memory went above MaxMemPerNode:
,---- | JobId=133982 UserId=......(10487) GroupId=domain users(2000) Name=test JobState=CANCELLED Partition=short TimeLimit=45 StartTime=2024-09-04T09:11:17 EndTime=2024-09-04T09:11:24 NodeList=...... NodeCnt=1 ProcCnt=76 WorkDir=/tmp/. ReservationName= Tres=cpu=76,mem=304000M,node=1,billing=76 Account=ddgroup QOS=domino WcKey= Cluster=...... SubmitTime=2024-09-04T09:11:17 EligibleTime=2024-09-04T09:11:17 DerivedExitCode=0:0 ExitCode=0:0 `----
What is the best way to report issues like this to the Slurm developers? I thought of adding it to https://support.schedmd.com/ but it is not clear to me if that page is only meant for Slurm users with a Support Contract?
Cheers,