Hello again,
Angel de Vicente via slurm-users slurm-users@lists.schedmd.com writes:
[...] I don't understand is why the first three submissions below do get stopped by sbatch while the last one happily goes through?
,---- | $ sbatch -N 1 -n 1 -c 76 -p short --mem-per-cpu=4000M test.batch | sbatch: error: Batch job submission failed: Memory required by task is not available | | $ sbatch -N 1 -n 76 -c 1 -p short --mem-per-cpu=4000M test.batch | sbatch: error: Batch job submission failed: Memory required by task is not available | | $ sbatch -n 1 -c 76 -p short --mem-per-cpu=4000M test.batch | sbatch: error: Batch job submission failed: Memory required by task is not available `----
,---- | $ sbatch -n 76 -c 1 -p short --mem-per-cpu=4000M test.batch | Submitted batch job 133982 `----
Ah, I think I do perhaps understand now...
In the first three cases Slurm knows that everything is going to run inside a single node (either because I explicitly set "-N 1" or because I'm submitting a single task that uses 76 CPUs ("-n 1 -c 76"), and thus it knows that the required memory (304000M) exceeds the MaxMemPerNode configuration and it blocks the submission.
In the last case my job will have 76 single-cpu tasks, but I'm not explicitly asking for a number of nodes, so in theory the job could be split in a number of nodes and thus MaxMemPerNode would not be exceeded, so it lets the job go through.
[In my case I guess the confusion comes from the fact that there is only a node in this "cluster", so I see the four cases as basically identical, with regards to Memory limits].
In any case, if the fourth job above gets past the submission phase, I would think more reasonable that it never actually runs (in my system), because allocating it to run in a single node is like going back to submission #2
Cheers,