Hello,
we found an issue with Slurm 24.05.1 and the MaxMemPerNode setting. Slurm is installed in a single workstation, and thus, the number of nodes is just 1.
The relevant sections in slurm.conf read:
,---- | EnforcePartLimits=ALL | PartitionName=short Nodes=..... State=UP Default=YES MaxTime=2-00:00:00 MaxCPUsPerNode=76 MaxMemPerNode=231000 OverSubscribe=FORCE:1 `----
Now, if I submit a job requesting 76 CPUs and each one needing 4000M (for a total of 304000M), Slurm does indeed respect the MaxMemPerNode setting and the job is not submitted in the following cases ("-N 1" is not really necessary, as there is only one node):
,---- | $ sbatch -N 1 -n 1 -c 76 -p short --mem-per-cpu=4000M test.batch | sbatch: error: Batch job submission failed: Memory required by task is not available | | $ sbatch -N 1 -n 76 -c 1 -p short --mem-per-cpu=4000M test.batch | sbatch: error: Batch job submission failed: Memory required by task is not available | | $ sbatch -n 1 -c 76 -p short --mem-per-cpu=4000M test.batch | sbatch: error: Batch job submission failed: Memory required by task is not available `----
But with this submission Slurm is happy:
,---- | $ sbatch -n 76 -c 1 -p short --mem-per-cpu=4000M test.batch | Submitted batch job 133982 `----
and the slurmjobcomp.log file does indeed tell me that the memory went above MaxMemPerNode:
,---- | JobId=133982 UserId=......(10487) GroupId=domain users(2000) Name=test JobState=CANCELLED Partition=short TimeLimit=45 StartTime=2024-09-04T09:11:17 EndTime=2024-09-04T09:11:24 NodeList=...... NodeCnt=1 ProcCnt=76 WorkDir=/tmp/. ReservationName= Tres=cpu=76,mem=304000M,node=1,billing=76 Account=ddgroup QOS=domino WcKey= Cluster=...... SubmitTime=2024-09-04T09:11:17 EligibleTime=2024-09-04T09:11:17 DerivedExitCode=0:0 ExitCode=0:0 `----
What is the best way to report issues like this to the Slurm developers? I thought of adding it to https://support.schedmd.com/ but it is not clear to me if that page is only meant for Slurm users with a Support Contract?
Cheers,