Dear all
We have just installed a small SLURM cluster composed of 12 nodes:
- 6 CPU only nodes: 2 Sockets=2, 96 CoresPerSocket 2, ThreadsPerCore=2, 1.5 TB of RAM - 6 nodes with also GPUS: same conf of the CPU-only node + 4 H100 per node
We started with a setup with 2 partitions:
- a 'onlycpus' partition which sees all the cpu-only nodes - a 'gpus' partition which sees the nodes with gpus
and asked users to use the 'gpus' partition only for jobs that need gpus (for the time being we are not technically enforced that).
The problem is that a job requiring a GPU usually needs only a few cores and only a few GB of RAM, which means wasting a lot of CPU cores. And having all nodes in the same partition would mean that there is the risk that a job requiring a GPU can't start if all CPU cores and/or all memory is used by CPU only jobs
I went through the mailing list archive and I think that "splitting" a GPU node into two logical nodes (one to be used in the 'gpus' partition and one to be used in the 'onlycpus' partition) as discussed in [*] would help.
Since that proposed solution is considered by his author a "bit of a kludge" and since I read that splitting a node into multiple logical nodes is in a general a bad idea, I'd like to understand if you could suggest other/best options.
I also found this [**] thread, but I don't like too much that approach (i.e. relying on MaxCPUsPerNode) because it would mean having 3 partition (if I have got it right): two partitions for cpu only jobs and 1 partition for gpu jobs
Many thanks, Massimo
[*] https://groups.google.com/g/slurm-users/c/IUd7jLKME3M [**] https://groups.google.com/g/slurm-users/c/o7AiYAQ1YJ0
Ciao Massimo, How about creating another queue cpus_in_the_gpu_nodes (or something less silly) which targets the GPU nodes but does not allow the allocation of the GPUs with gres and allocates 96-8 (or whatever other number you deem appropriate) of the CPUs (and similarly with memory)? Actually it could even be the same "onlycpus" queue, just on different nodes.
In fact, in Slurm you declare the cores (and sockets) in a node-based, not queue-based, fashion. But you can set up an alias for those nodes with a second name and use such a second name in the way described above. I am not aware (and I have not searched for) Slurm be able to understand such a situation on its own and therefore you will have to manually avoid "double booking". One way of doing that could be to configure the nodes with their first name in a way that Slurm thinks they have less resources. So for example in slurm.conf
NodeName=gpu[01-06] CoresPerSocket=4 RealMemory=whatever1 Sockets=2 ThreadsPerCore=1 Weight=10000 State=UNKNOWN Gres=gpu:h100:4 NodeName=cpusingpu[01-06] CoresPerSocket=44 RealMemory=whatever2 Sockets=2 ThreadsPerCore=1 Weight=10000 State=UNKNOWN
where gpuNN and cpusingpuNN are physically the same node and whatever1 + whatever2 is the actual maximum amount of memory you want Slurm to allocate. And you will also want to make sure the Weight are such that the non-GPU nodes get used first.
Disclaimer: I'm thinking out loud, I have not tested this in practice, there may be something I overlooked.
On Mon, Mar 31, 2025 at 5:12 AM Massimo Sgaravatto via slurm-users < slurm-users@lists.schedmd.com> wrote:
Dear all
We have just installed a small SLURM cluster composed of 12 nodes:
- 6 CPU only nodes: 2 Sockets=2, 96 CoresPerSocket 2, ThreadsPerCore=2,
1.5 TB of RAM
- 6 nodes with also GPUS: same conf of the CPU-only node + 4 H100 per node
We started with a setup with 2 partitions:
- a 'onlycpus' partition which sees all the cpu-only nodes
- a 'gpus' partition which sees the nodes with gpus
and asked users to use the 'gpus' partition only for jobs that need gpus (for the time being we are not technically enforced that).
The problem is that a job requiring a GPU usually needs only a few cores and only a few GB of RAM, which means wasting a lot of CPU cores. And having all nodes in the same partition would mean that there is the risk that a job requiring a GPU can't start if all CPU cores and/or all memory is used by CPU only jobs
I went through the mailing list archive and I think that "splitting" a GPU node into two logical nodes (one to be used in the 'gpus' partition and one to be used in the 'onlycpus' partition) as discussed in [*] would help.
Since that proposed solution is considered by his author a "bit of a kludge" and since I read that splitting a node into multiple logical nodes is in a general a bad idea, I'd like to understand if you could suggest other/best options.
I also found this [**] thread, but I don't like too much that approach (i.e. relying on MaxCPUsPerNode) because it would mean having 3 partition (if I have got it right): two partitions for cpu only jobs and 1 partition for gpu jobs
Many thanks, Massimo
[*] https://groups.google.com/g/slurm-users/c/IUd7jLKME3M [**] https://groups.google.com/g/slurm-users/c/o7AiYAQ1YJ0
-- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-leave@lists.schedmd.com
Hi Davide Thanks for your feedback
If gpu01 and cpusingpu01 are physically the same node, doesn't this mean that I have to start 2 slurmd on that node (one with "slurmd -N gpu01" and one with "slurmd -N cpusingpu01") ?
Thanks, Massimo
On Mon, Mar 31, 2025 at 3:22 PM Davide DelVento davide.quantum@gmail.com wrote:
Ciao Massimo, How about creating another queue cpus_in_the_gpu_nodes (or something less silly) which targets the GPU nodes but does not allow the allocation of the GPUs with gres and allocates 96-8 (or whatever other number you deem appropriate) of the CPUs (and similarly with memory)? Actually it could even be the same "onlycpus" queue, just on different nodes.
In fact, in Slurm you declare the cores (and sockets) in a node-based, not queue-based, fashion. But you can set up an alias for those nodes with a second name and use such a second name in the way described above. I am not aware (and I have not searched for) Slurm be able to understand such a situation on its own and therefore you will have to manually avoid "double booking". One way of doing that could be to configure the nodes with their first name in a way that Slurm thinks they have less resources. So for example in slurm.conf
NodeName=gpu[01-06] CoresPerSocket=4 RealMemory=whatever1 Sockets=2 ThreadsPerCore=1 Weight=10000 State=UNKNOWN Gres=gpu:h100:4 NodeName=cpusingpu[01-06] CoresPerSocket=44 RealMemory=whatever2 Sockets=2 ThreadsPerCore=1 Weight=10000 State=UNKNOWN
where gpuNN and cpusingpuNN are physically the same node and whatever1 + whatever2 is the actual maximum amount of memory you want Slurm to allocate. And you will also want to make sure the Weight are such that the non-GPU nodes get used first.
Disclaimer: I'm thinking out loud, I have not tested this in practice, there may be something I overlooked.
On Mon, Mar 31, 2025 at 5:12 AM Massimo Sgaravatto via slurm-users < slurm-users@lists.schedmd.com> wrote:
Dear all
We have just installed a small SLURM cluster composed of 12 nodes:
- 6 CPU only nodes: 2 Sockets=2, 96 CoresPerSocket 2, ThreadsPerCore=2,
1.5 TB of RAM
- 6 nodes with also GPUS: same conf of the CPU-only node + 4 H100 per node
We started with a setup with 2 partitions:
- a 'onlycpus' partition which sees all the cpu-only nodes
- a 'gpus' partition which sees the nodes with gpus
and asked users to use the 'gpus' partition only for jobs that need gpus (for the time being we are not technically enforced that).
The problem is that a job requiring a GPU usually needs only a few cores and only a few GB of RAM, which means wasting a lot of CPU cores. And having all nodes in the same partition would mean that there is the risk that a job requiring a GPU can't start if all CPU cores and/or all memory is used by CPU only jobs
I went through the mailing list archive and I think that "splitting" a GPU node into two logical nodes (one to be used in the 'gpus' partition and one to be used in the 'onlycpus' partition) as discussed in [*] would help.
Since that proposed solution is considered by his author a "bit of a kludge" and since I read that splitting a node into multiple logical nodes is in a general a bad idea, I'd like to understand if you could suggest other/best options.
I also found this [**] thread, but I don't like too much that approach (i.e. relying on MaxCPUsPerNode) because it would mean having 3 partition (if I have got it right): two partitions for cpu only jobs and 1 partition for gpu jobs
Many thanks, Massimo
[*] https://groups.google.com/g/slurm-users/c/IUd7jLKME3M [**] https://groups.google.com/g/slurm-users/c/o7AiYAQ1YJ0
-- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-leave@lists.schedmd.com
Yes, I think so, but that should be no problem. I think that requires your Slurm was built using the --enable-multiple-slurmd configure option, so you might need to rebuild Slurm, if you didn't use that option in the first place.
On Mon, Mar 31, 2025 at 7:32 AM Massimo Sgaravatto < massimo.sgaravatto@gmail.com> wrote:
Hi Davide Thanks for your feedback
If gpu01 and cpusingpu01 are physically the same node, doesn't this mean that I have to start 2 slurmd on that node (one with "slurmd -N gpu01" and one with "slurmd -N cpusingpu01") ?
Thanks, Massimo
On Mon, Mar 31, 2025 at 3:22 PM Davide DelVento davide.quantum@gmail.com wrote:
Ciao Massimo, How about creating another queue cpus_in_the_gpu_nodes (or something less silly) which targets the GPU nodes but does not allow the allocation of the GPUs with gres and allocates 96-8 (or whatever other number you deem appropriate) of the CPUs (and similarly with memory)? Actually it could even be the same "onlycpus" queue, just on different nodes.
In fact, in Slurm you declare the cores (and sockets) in a node-based, not queue-based, fashion. But you can set up an alias for those nodes with a second name and use such a second name in the way described above. I am not aware (and I have not searched for) Slurm be able to understand such a situation on its own and therefore you will have to manually avoid "double booking". One way of doing that could be to configure the nodes with their first name in a way that Slurm thinks they have less resources. So for example in slurm.conf
NodeName=gpu[01-06] CoresPerSocket=4 RealMemory=whatever1 Sockets=2 ThreadsPerCore=1 Weight=10000 State=UNKNOWN Gres=gpu:h100:4 NodeName=cpusingpu[01-06] CoresPerSocket=44 RealMemory=whatever2 Sockets=2 ThreadsPerCore=1 Weight=10000 State=UNKNOWN
where gpuNN and cpusingpuNN are physically the same node and whatever1 + whatever2 is the actual maximum amount of memory you want Slurm to allocate. And you will also want to make sure the Weight are such that the non-GPU nodes get used first.
Disclaimer: I'm thinking out loud, I have not tested this in practice, there may be something I overlooked.
On Mon, Mar 31, 2025 at 5:12 AM Massimo Sgaravatto via slurm-users < slurm-users@lists.schedmd.com> wrote:
Dear all
We have just installed a small SLURM cluster composed of 12 nodes:
- 6 CPU only nodes: 2 Sockets=2, 96 CoresPerSocket 2, ThreadsPerCore=2,
1.5 TB of RAM
- 6 nodes with also GPUS: same conf of the CPU-only node + 4 H100 per
node
We started with a setup with 2 partitions:
- a 'onlycpus' partition which sees all the cpu-only nodes
- a 'gpus' partition which sees the nodes with gpus
and asked users to use the 'gpus' partition only for jobs that need gpus (for the time being we are not technically enforced that).
The problem is that a job requiring a GPU usually needs only a few cores and only a few GB of RAM, which means wasting a lot of CPU cores. And having all nodes in the same partition would mean that there is the risk that a job requiring a GPU can't start if all CPU cores and/or all memory is used by CPU only jobs
I went through the mailing list archive and I think that "splitting" a GPU node into two logical nodes (one to be used in the 'gpus' partition and one to be used in the 'onlycpus' partition) as discussed in [*] would help.
Since that proposed solution is considered by his author a "bit of a kludge" and since I read that splitting a node into multiple logical nodes is in a general a bad idea, I'd like to understand if you could suggest other/best options.
I also found this [**] thread, but I don't like too much that approach (i.e. relying on MaxCPUsPerNode) because it would mean having 3 partition (if I have got it right): two partitions for cpu only jobs and 1 partition for gpu jobs
Many thanks, Massimo
[*] https://groups.google.com/g/slurm-users/c/IUd7jLKME3M [**] https://groups.google.com/g/slurm-users/c/o7AiYAQ1YJ0
-- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-leave@lists.schedmd.com
What I have done is setup partition QOSes for nodes with 4 GPUs and 64 cores
sacctmgr add qos lcncpu-part sacctmgr modify qos lcncpu-part set priority=20 \ flags=DenyOnLimit MaxTRESPerNode=cpu=32,gres/gpu=0
sacctmgr add qos lcngpu-part sacctmgr modify qos lcn-part set priority=20 \ flags=DenyOnLimit MaxTRESPerNode=cpu=32,gres/gpu=4
Then defined a lcncpu and lcngpu partition that use QOS=lcncpu-part and QOS=lcngpu-part respecitively over the nodes.
I think one could oversubscribe the cores, such as cpu=48 on each, if you want to allow it.
On Mon, 31 Mar 2025 9:21am, Davide DelVento via slurm-users wrote:
External Email - Use Caution
Ciao Massimo, How about creating another queue cpus_in_the_gpu_nodes (or something less silly) which targets the GPU nodes but does not allow the allocation of the GPUs with gres and allocates 96-8 (or whatever other number you deem appropriate) of the CPUs (and similarly with memory)? Actually it could even be the same "onlycpus" queue, just on different nodes.
In fact, in Slurm you declare the cores (and sockets) in a node-based, not queue-based, fashion. But you can set up an alias for those nodes with a second name and use such a second name in the way described above. I am not aware (and I have not searched for) Slurm be able to understand such a situation on its own and therefore you will have to manually avoid "double booking". One way of doing that could be to configure the nodes with their first name in a way that Slurm thinks they have less resources. So for example in slurm.conf
NodeName=gpu[01-06] CoresPerSocket=4 RealMemory=whatever1 Sockets=2 ThreadsPerCore=1 Weight=10000 State=UNKNOWN Gres=gpu:h100:4 NodeName=cpusingpu[01-06] CoresPerSocket=44 RealMemory=whatever2 Sockets=2 ThreadsPerCore=1 Weight=10000 State=UNKNOWN
where gpuNN and cpusingpuNN are physically the same node and whatever1 + whatever2 is the actual maximum amount of memory you want Slurm to allocate. And you will also want to make sure the Weight are such that the non-GPU nodes get used first.
Disclaimer: I'm thinking out loud, I have not tested this in practice, there may be something I overlooked.
On Mon, Mar 31, 2025 at 5:12 AM Massimo Sgaravatto via slurm-users < slurm-users@lists.schedmd.com> wrote:
Dear all
We have just installed a small SLURM cluster composed of 12 nodes:
- 6 CPU only nodes: 2 Sockets=2, 96 CoresPerSocket 2, ThreadsPerCore=2,
1.5 TB of RAM
- 6 nodes with also GPUS: same conf of the CPU-only node + 4 H100 per node
We started with a setup with 2 partitions:
- a 'onlycpus' partition which sees all the cpu-only nodes
- a 'gpus' partition which sees the nodes with gpus
and asked users to use the 'gpus' partition only for jobs that need gpus (for the time being we are not technically enforced that).
The problem is that a job requiring a GPU usually needs only a few cores and only a few GB of RAM, which means wasting a lot of CPU cores. And having all nodes in the same partition would mean that there is the risk that a job requiring a GPU can't start if all CPU cores and/or all memory is used by CPU only jobs
I went through the mailing list archive and I think that "splitting" a GPU node into two logical nodes (one to be used in the 'gpus' partition and one to be used in the 'onlycpus' partition) as discussed in [*] would help.
Since that proposed solution is considered by his author a "bit of a kludge" and since I read that splitting a node into multiple logical nodes is in a general a bad idea, I'd like to understand if you could suggest other/best options.
I also found this [**] thread, but I don't like too much that approach (i.e. relying on MaxCPUsPerNode) because it would mean having 3 partition (if I have got it right): two partitions for cpu only jobs and 1 partition for gpu jobs
Many thanks, Massimo
[*] https://secure-web.cisco.com/1cH9ryK4TyqxTHJm7KwNJ6iBEJ8wu9Kfs4T1tORTZYca4R6... [**] https://secure-web.cisco.com/1xNDmf_w2iirJeksDwWeeE5TdQjBCzKVWiNJ0ZT7ZEsfBq3...
-- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-leave@lists.schedmd.com
The information in this e-mail is intended only for the person to whom it is addressed. If you believe this e-mail was sent to you in error and the e-mail contains patient information, please contact the Mass General Brigham Compliance HelpLine at https://www.massgeneralbrigham.org/complianceline https://www.massgeneralbrigham.org/complianceline . Please note that this e-mail is not secure (encrypted). If you do not wish to continue communication over unencrypted e-mail, please notify the sender of this message immediately. Continuing to send or respond to e-mail after receiving this message means you understand and accept this risk and wish to continue to communicate over unencrypted e-mail.
To me at least the simplest solution would be to create 3 partitions. The first is for the cpu only nodes, the second is the gpu nodes and the third is a lower priority requeue partition. This is how we do it here. This way the requeue partition can be used to grab the cpu's on the gpu nodes without preventing jobs in the gpu partition from grabbing the cpus when they need them to launch gpu jobs.
In our case we add an additional layer in that we have a job_submit.lua script that prevents users from submitting cpu only jobs to the gpu partition. Thus people wanting to use cpus could submit to both the cpu and the requeue partition (as slurm permits multipartition submissions) and then the gpu partition won't be blocked by anything and you can farm the space gpu cycles. This works well for our needs.
-Paul Edmon-
On 3/31/2025 9:39 AM, Paul Raines via slurm-users wrote:
What I have done is setup partition QOSes for nodes with 4 GPUs and 64 cores
sacctmgr add qos lcncpu-part sacctmgr modify qos lcncpu-part set priority=20 \ flags=DenyOnLimit MaxTRESPerNode=cpu=32,gres/gpu=0
sacctmgr add qos lcngpu-part sacctmgr modify qos lcn-part set priority=20 \ flags=DenyOnLimit MaxTRESPerNode=cpu=32,gres/gpu=4
Then defined a lcncpu and lcngpu partition that use QOS=lcncpu-part and QOS=lcngpu-part respecitively over the nodes.
I think one could oversubscribe the cores, such as cpu=48 on each, if you want to allow it.
On Mon, 31 Mar 2025 9:21am, Davide DelVento via slurm-users wrote:
External Email - Use Caution Ciao Massimo, How about creating another queue cpus_in_the_gpu_nodes (or something less silly) which targets the GPU nodes but does not allow the allocation of the GPUs with gres and allocates 96-8 (or whatever other number you deem appropriate) of the CPUs (and similarly with memory)? Actually it could even be the same "onlycpus" queue, just on different nodes.
In fact, in Slurm you declare the cores (and sockets) in a node-based, not queue-based, fashion. But you can set up an alias for those nodes with a second name and use such a second name in the way described above. I am not aware (and I have not searched for) Slurm be able to understand such a situation on its own and therefore you will have to manually avoid "double booking". One way of doing that could be to configure the nodes with their first name in a way that Slurm thinks they have less resources. So for example in slurm.conf
NodeName=gpu[01-06] CoresPerSocket=4 RealMemory=whatever1 Sockets=2 ThreadsPerCore=1 Weight=10000 State=UNKNOWN Gres=gpu:h100:4 NodeName=cpusingpu[01-06] CoresPerSocket=44 RealMemory=whatever2 Sockets=2 ThreadsPerCore=1 Weight=10000 State=UNKNOWN
where gpuNN and cpusingpuNN are physically the same node and whatever1 + whatever2 is the actual maximum amount of memory you want Slurm to allocate. And you will also want to make sure the Weight are such that the non-GPU nodes get used first.
Disclaimer: I'm thinking out loud, I have not tested this in practice, there may be something I overlooked.
On Mon, Mar 31, 2025 at 5:12 AM Massimo Sgaravatto via slurm-users < slurm-users@lists.schedmd.com> wrote:
Dear all
We have just installed a small SLURM cluster composed of 12 nodes:
- 6 CPU only nodes: 2 Sockets=2, 96 CoresPerSocket 2, ThreadsPerCore=2,
1.5 TB of RAM
- 6 nodes with also GPUS: same conf of the CPU-only node + 4 H100
per node
We started with a setup with 2 partitions:
- a 'onlycpus' partition which sees all the cpu-only nodes
- a 'gpus' partition which sees the nodes with gpus
and asked users to use the 'gpus' partition only for jobs that need gpus (for the time being we are not technically enforced that).
The problem is that a job requiring a GPU usually needs only a few cores and only a few GB of RAM, which means wasting a lot of CPU cores. And having all nodes in the same partition would mean that there is the risk that a job requiring a GPU can't start if all CPU cores and/or all memory is used by CPU only jobs
I went through the mailing list archive and I think that "splitting" a GPU node into two logical nodes (one to be used in the 'gpus' partition and one to be used in the 'onlycpus' partition) as discussed in [*] would help.
Since that proposed solution is considered by his author a "bit of a kludge" and since I read that splitting a node into multiple logical nodes is in a general a bad idea, I'd like to understand if you could suggest other/best options.
I also found this [**] thread, but I don't like too much that approach (i.e. relying on MaxCPUsPerNode) because it would mean having 3 partition (if I have got it right): two partitions for cpu only jobs and 1 partition for gpu jobs
Many thanks, Massimo
[*] https://secure-web.cisco.com/1cH9ryK4TyqxTHJm7KwNJ6iBEJ8wu9Kfs4T1tORTZYca4R6... [**] https://secure-web.cisco.com/1xNDmf_w2iirJeksDwWeeE5TdQjBCzKVWiNJ0ZT7ZEsfBq3...
-- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-leave@lists.schedmd.com
The information in this e-mail is intended only for the person to whom it is addressed. If you believe this e-mail was sent to you in error and the e-mail contains patient information, please contact the Mass General Brigham Compliance HelpLine at https://www.massgeneralbrigham.org/complianceline https://www.massgeneralbrigham.org/complianceline . Please note that this e-mail is not secure (encrypted). If you do not wish to continue communication over unencrypted e-mail, please notify the sender of this message immediately. Continuing to send or respond to e-mail after receiving this message means you understand and accept this risk and wish to continue to communicate over unencrypted e-mail.