Hello,
Previously we were running 22.05.10 and could submit a "multinode" job using only the total number of cores to run, not the number of nodes. For example, in a cluster containing only 40-core nodes (no hyperthreading), Slurm would determine two nodes were needed with only: sbatch -p multinode -n 80 --wrap="...."
Now in 23.02.1 this is no longer the case - we get: sbatch: error: Batch job submission failed: Node count specification invalid
At least -N 2 is must be used (-n 80 can be added) sbatch -p multinode -N 2 -n 80 --wrap="...."
The partition config was, and is, as follows (MinNodes=2 to reject small jobs submitted to this partition - we want at least two nodes requested) PartitionName=multinode State=UP Nodes=node[081-245] DefaultTime=168:00:00 MaxTime=168:00:00 PreemptMode=OFF PriorityTier=1 DefMemPerCPU=4096 MinNodes=2 QOS=multinode Oversubscribe=EXCLUSIVE Default=NO
All nodes are of the form NodeName=node245 NodeAddr=node245 State=UNKNOWN Procs=40 Sockets=2 CoresPerSocket=20 ThreadsPerCore=1 RealMemory=187000
slurm.conf has EnforcePartLimits = ANY SelectType = select/cons_tres TaskPlugin = task/cgroup,task/affinity
A few fields from: sacctmgr show qos multinode Name|Flags|MaxTRES multinode|DenyOnLimit|node=5
The sbatch/srun man page states: -n, --ntasks .... If -N is not specified, the default behavior is to allocate enough nodes to satisfy the requested resources as expressed by per-job specification options, e.g. -n, -c and --gpus.
I've had a look through release notes back to 22.05.10 but can't see anything obvious (to me).
Has this behaviour changed? Or, more likely, what have I missed ;-) ?
Many thanks, George
-- George Leaver Research Infrastructure, IT Services, University of Manchester http://ri.itservices.manchester.ac.uk%C2%A0%7C%C2%A0@UoM_eResearch
Hi George,
George Leaver via slurm-users slurm-users@lists.schedmd.com writes:
Hello,
Previously we were running 22.05.10 and could submit a "multinode" job using only the total number of cores to run, not the number of nodes. For example, in a cluster containing only 40-core nodes (no hyperthreading), Slurm would determine two nodes were needed with only: sbatch -p multinode -n 80 --wrap="...."
Now in 23.02.1 this is no longer the case - we get: sbatch: error: Batch job submission failed: Node count specification invalid
At least -N 2 is must be used (-n 80 can be added) sbatch -p multinode -N 2 -n 80 --wrap="...."
The partition config was, and is, as follows (MinNodes=2 to reject small jobs submitted to this partition - we want at least two nodes requested) PartitionName=multinode State=UP Nodes=node[081-245] DefaultTime=168:00:00 MaxTime=168:00:00 PreemptMode=OFF PriorityTier=1 DefMemPerCPU=4096 MinNodes=2 QOS=multinode Oversubscribe=EXCLUSIVE Default=NO
But do you really want to force a job to use two nodes if it could in fact run on one?
What is the use-case for having separate 'uninode' and 'multinode' partitions? We have a university cluster with a very wide range of jobs and essentially a single partition. Allowing all job types to use one partition means that the different resource requirements tend to complement each other to some degree. Doesn't splitting up your jobs over two partitions mean that either one of the two partitions could be full, while the other has idle nodes?
Cheers,
Loris
All nodes are of the form NodeName=node245 NodeAddr=node245 State=UNKNOWN Procs=40 Sockets=2 CoresPerSocket=20 ThreadsPerCore=1 RealMemory=187000
slurm.conf has EnforcePartLimits = ANY SelectType = select/cons_tres TaskPlugin = task/cgroup,task/affinity
A few fields from: sacctmgr show qos multinode Name|Flags|MaxTRES multinode|DenyOnLimit|node=5
The sbatch/srun man page states: -n, --ntasks .... If -N is not specified, the default behavior is to allocate enough nodes to satisfy the requested resources as expressed by per-job specification options, e.g. -n, -c and --gpus.
I've had a look through release notes back to 22.05.10 but can't see anything obvious (to me).
Has this behaviour changed? Or, more likely, what have I missed ;-) ?
Many thanks, George
-- George Leaver Research Infrastructure, IT Services, University of Manchester http://ri.itservices.manchester.ac.uk%C2%A0%7C%C2%A0@UoM_eResearch
Hi Loris,
Doesn't splitting up your jobs over two partitions mean that either one of the two partitions could be full, while the other has idle nodes?
Yes, potentially, and we may move away from our current config at some point (it's a bit of a hangover from an SGE cluster.) Hasn't really been an issue at the moment.
Do you find fragmentation a problem? Or do you just let the bf scheduler handle that (assuming jobs have a realistic wallclock request?)
But for now, would be handy if users didn't need to adjust their jobscripts (or we didn't need to write a submit script.)
Regards, George
-- George Leaver Research Infrastructure, IT Services, University of Manchester http://ri.itservices.manchester.ac.uk%C2%A0%7C%C2%A0@UoM_eResearch
-- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-leave@lists.schedmd.com
Hi George,
George Leaver via slurm-users slurm-users@lists.schedmd.com writes:
Hi Loris,
Doesn't splitting up your jobs over two partitions mean that either one of the two partitions could be full, while the other has idle nodes?
Yes, potentially, and we may move away from our current config at some point (it's a bit of a hangover from an SGE cluster.) Hasn't really been an issue at the moment.
Do you find fragmentation a problem? Or do you just let the bf scheduler handle that (assuming jobs have a realistic wallclock request?)
Well, not with essentially only one partition we don't have fragmentation related to that. We did used to have multiple partitions for different run-times, we did have fragmentation. However, I couldn't see any advantage in that setup, so we moved to one partition and various QOS to handle say test or debug jobs. However, users do still sometimes add potentially arbitrary conditions to their jobs script, such as the number of nodes for MPI jobs. Whereas in principal it may be a good idea to reduce the MPI-overhead by reducing the number of nodes, in practice any such advantage may well be cancelled out or exceeded by the extra time the job is going to have to wait for those specific resources.
Backfill works fairly well for us, although indeed not without a little badgering of users to get them to specify appropriate run-times.
But for now, would be handy if users didn't need to adjust their jobscripts (or we didn't need to write a submit script.)
If you ditch one of the partitions, you could always use a job submit plug-in to replace the invalid partition specified by the job by the remaining partition.
Cheers,
Loris
Regards, George
-- George Leaver Research Infrastructure, IT Services, University of Manchester http://ri.itservices.manchester.ac.uk%C2%A0%7C%C2%A0@UoM_eResearch
-- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-leave@lists.schedmd.com