With slurm, how to allocate a whole node for a single multi-threaded process?

Hello, everyone, with slurm, how to allocate a whole node for a single multi-threaded process?

https://stackoverflow.com/questions/78818547/with-slurm-how-to-allocate-a-wh...

-- Henrique Dante de Almeida hdante@gmail.com

-- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-leave@lists.schedmd.com

Henrique Almeida

5:31 p.m.

Hello, I'm testing it right now and it's working pretty well in a normal situation, but that's not exactly what I want. --exclusive documentation says that the job allocation cannot share nodes with other running jobs, but I want to allow it to do so, if that's unavoidable. Are there other ways to configure it ?

The current parameters I'm testing:

sbatch -N 1 --exclusive --ntasks-per-node=1 --mem=0 pz-train.batch

On Thu, Aug 1, 2024 at 12:29 PM Davide DelVento davide.quantum@gmail.com wrote:

...

In part, it depends on how it's been configured, but have you tried --exclusive?

On Thu, Aug 1, 2024 at 7:39 AM Henrique Almeida via slurm-users slurm-users@lists.schedmd.com wrote:

...
Hello, everyone, with slurm, how to allocate a whole node for a single multi-threaded process?

https://stackoverflow.com/questions/78818547/with-slurm-how-to-allocate-a-wh...

-- Henrique Dante de Almeida hdante@gmail.com

-- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-leave@lists.schedmd.com

-- Henrique Dante de Almeida hdante@gmail.com

Jason Simms

6:08 p.m.

On the one hand, you say you want "to *allocate a whole node* for a single multi-threaded process," but on the other you say you want to allow it to "*share nodes* with other running jobs." Those seem like mutually exclusive requirements.

Jason

On Thu, Aug 1, 2024 at 1:32 PM Henrique Almeida via slurm-users < slurm-users@lists.schedmd.com> wrote:

...

Hello, I'm testing it right now and it's working pretty well in a normal situation, but that's not exactly what I want. --exclusive documentation says that the job allocation cannot share nodes with other running jobs, but I want to allow it to do so, if that's unavoidable. Are there other ways to configure it ?

The current parameters I'm testing:
sbatch -N 1 --exclusive --ntasks-per-node=1 --mem=0 pz-train.batch
On Thu, Aug 1, 2024 at 12:29 PM Davide DelVento davide.quantum@gmail.com wrote:

...
In part, it depends on how it's been configured, but have you tried

--exclusive?

...
On Thu, Aug 1, 2024 at 7:39 AM Henrique Almeida via slurm-users <

slurm-users@lists.schedmd.com> wrote:

...
...
Hello, everyone, with slurm, how to allocate a whole node for a single multi-threaded process?

https://stackoverflow.com/questions/78818547/with-slurm-how-to-allocate-a-wh...

...
...
-- Henrique Dante de Almeida hdante@gmail.com

-- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-leave@lists.schedmd.com

-- Henrique Dante de Almeida hdante@gmail.com

-- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-leave@lists.schedmd.com

-- *Jason L. Simms, Ph.D., M.P.H.* Manager of Research Computing Swarthmore College Information Technology Services (610) 328-8102 Schedule a meeting: https://calendly.com/jlsimms

Henrique Almeida

7:17 p.m.

Hello, maybe rephrase the question to fill a whole node ?

On Thu, Aug 1, 2024 at 3:08 PM Jason Simms jsimms1@swarthmore.edu wrote:

...

On the one hand, you say you want "to allocate a whole node for a single multi-threaded process," but on the other you say you want to allow it to "share nodes with other running jobs." Those seem like mutually exclusive requirements.

Jason

On Thu, Aug 1, 2024 at 1:32 PM Henrique Almeida via slurm-users slurm-users@lists.schedmd.com wrote:

...
Hello, I'm testing it right now and it's working pretty well in a normal situation, but that's not exactly what I want. --exclusive documentation says that the job allocation cannot share nodes with other running jobs, but I want to allow it to do so, if that's unavoidable. Are there other ways to configure it ?

The current parameters I'm testing:
sbatch -N 1 --exclusive --ntasks-per-node=1 --mem=0 pz-train.batch
On Thu, Aug 1, 2024 at 12:29 PM Davide DelVento davide.quantum@gmail.com wrote:

...
In part, it depends on how it's been configured, but have you tried --exclusive?

On Thu, Aug 1, 2024 at 7:39 AM Henrique Almeida via slurm-users slurm-users@lists.schedmd.com wrote:

...
Hello, everyone, with slurm, how to allocate a whole node for a single multi-threaded process?

https://stackoverflow.com/questions/78818547/with-slurm-how-to-allocate-a-wh...

-- Henrique Dante de Almeida hdante@gmail.com

-- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-leave@lists.schedmd.com

-- Henrique Dante de Almeida hdante@gmail.com

-- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-leave@lists.schedmd.com
-- Jason L. Simms, Ph.D., M.P.H. Manager of Research Computing Swarthmore College Information Technology Services (610) 328-8102 Schedule a meeting: https://calendly.com/jlsimms

-- Henrique Dante de Almeida hdante@gmail.com

Bill

7:25 p.m.

Either allocate the whole node's cores or the whole node's memory? Both will allocate the node exclusively for you.

So you'll need to know what a node looks like. For a homogeneous cluster, this is straightforward. For a heterogeneous cluster, you may also need to specify a nodelist for say those 28 core nodes and then those 64 core nodes.

But going back to the original answer, --exclusive, is the answer here. You DO know how many cores you need right? (Scaling study should give you that). And you DO know the memory footprint by past jobs with similar inputs I hope.

Bill

On 8/1/24 3:17 PM, Henrique Almeida via slurm-users wrote:

...

Hello, maybe rephrase the question to fill a whole node ?

On Thu, Aug 1, 2024 at 3:08 PM Jason Simms jsimms1@swarthmore.edu wrote:

...
On the one hand, you say you want "to allocate a whole node for a single multi-threaded process," but on the other you say you want to allow it to "share nodes with other running jobs." Those seem like mutually exclusive requirements.

Jason

On Thu, Aug 1, 2024 at 1:32 PM Henrique Almeida via slurm-users slurm-users@lists.schedmd.com wrote:

...
Hello, I'm testing it right now and it's working pretty well in a normal situation, but that's not exactly what I want. --exclusive documentation says that the job allocation cannot share nodes with other running jobs, but I want to allow it to do so, if that's unavoidable. Are there other ways to configure it ?

The current parameters I'm testing:
 sbatch -N 1 --exclusive --ntasks-per-node=1 --mem=0 pz-train.batch
On Thu, Aug 1, 2024 at 12:29 PM Davide DelVento davide.quantum@gmail.com wrote:

...
In part, it depends on how it's been configured, but have you tried --exclusive?

On Thu, Aug 1, 2024 at 7:39 AM Henrique Almeida via slurm-users slurm-users@lists.schedmd.com wrote:

...
Hello, everyone, with slurm, how to allocate a whole node for a single multi-threaded process?

https://stackoverflow.com/questions/78818547/with-slurm-how-to-allocate-a-wh...

-- Henrique Dante de Almeida hdante@gmail.com

-- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-leave@lists.schedmd.com

-- Henrique Dante de Almeida hdante@gmail.com

-- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-leave@lists.schedmd.com
-- Jason L. Simms, Ph.D., M.P.H. Manager of Research Computing Swarthmore College Information Technology Services (610) 328-8102 Schedule a meeting: https://calendly.com/jlsimms

Henrique Almeida

8:01 p.m.

Bill, would this allow allocating all the remaining harts when the node is initially half full ? How are the parameters set up for that ? The cluster has 14 machines with 56 harts and 128 GB RAM and 12 machines with 104 harts and 256 GB RAM.

Some of the algorithms used have hot loops that scale close to or beyond the number of harts, so it will always be beneficial to use all harts available in an opportunistic, best-effort way. The algorithms are for training photometric galaxy redshift estimators (galaxy distance calculators). Training will be done with a certain frequency due to the large amount of available physical parameters. The amount of memory that's being required right now seems to be below 10 GB, but I can't say for all algorithms that will be used (at least 6 different ones), nor for different parameters expected to be required.

On Thu, Aug 1, 2024 at 4:27 PM Bill via slurm-users slurm-users@lists.schedmd.com wrote:

...

Either allocate the whole node's cores or the whole node's memory? Both will allocate the node exclusively for you.

So you'll need to know what a node looks like. For a homogeneous cluster, this is straightforward. For a heterogeneous cluster, you may also need to specify a nodelist for say those 28 core nodes and then those 64 core nodes.

But going back to the original answer, --exclusive, is the answer here. You DO know how many cores you need right? (Scaling study should give you that). And you DO know the memory footprint by past jobs with similar inputs I hope.

Bill

On 8/1/24 3:17 PM, Henrique Almeida via slurm-users wrote:

...
Hello, maybe rephrase the question to fill a whole node ?

On Thu, Aug 1, 2024 at 3:08 PM Jason Simms jsimms1@swarthmore.edu wrote:

...
On the one hand, you say you want "to allocate a whole node for a single multi-threaded process," but on the other you say you want to allow it to "share nodes with other running jobs." Those seem like mutually exclusive requirements.

Jason

On Thu, Aug 1, 2024 at 1:32 PM Henrique Almeida via slurm-users slurm-users@lists.schedmd.com wrote:

...
Hello, I'm testing it right now and it's working pretty well in a normal situation, but that's not exactly what I want. --exclusive documentation says that the job allocation cannot share nodes with other running jobs, but I want to allow it to do so, if that's unavoidable. Are there other ways to configure it ?

The current parameters I'm testing:
 sbatch -N 1 --exclusive --ntasks-per-node=1 --mem=0 pz-train.batch
On Thu, Aug 1, 2024 at 12:29 PM Davide DelVento davide.quantum@gmail.com wrote:

...
In part, it depends on how it's been configured, but have you tried --exclusive?

On Thu, Aug 1, 2024 at 7:39 AM Henrique Almeida via slurm-users slurm-users@lists.schedmd.com wrote:

...
Hello, everyone, with slurm, how to allocate a whole node for a single multi-threaded process?

https://stackoverflow.com/questions/78818547/with-slurm-how-to-allocate-a-wh...

-- Henrique Dante de Almeida hdante@gmail.com

-- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-leave@lists.schedmd.com

-- Henrique Dante de Almeida hdante@gmail.com

-- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-leave@lists.schedmd.com
-- Jason L. Simms, Ph.D., M.P.H. Manager of Research Computing Swarthmore College Information Technology Services (610) 328-8102 Schedule a meeting: https://calendly.com/jlsimms
-- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-leave@lists.schedmd.com

-- Henrique Dante de Almeida hdante@gmail.com

Laura Hild

6:08 p.m.

Hi Henrique. Can you give an example of sharing being unavoidable?

Henrique Almeida

7:19 p.m.

Hello, sharing would be unavoidable when all nodes are either fully or partially allocated. There will be cases of very simple background tasks occupying, for example, 1 hart in a machine.

On Thu, Aug 1, 2024 at 3:08 PM Laura Hild lsh@jlab.org wrote:

...

Hi Henrique. Can you give an example of sharing being unavoidable?

-- Henrique Dante de Almeida hdante@gmail.com

Laura Hild

7:28 p.m.

So you're wanting that, instead of waiting for the task to finish and then running on the whole node, that the job should run immediately on n-1 CPUs? If there were only one CPU available in the entire cluster, would you want the job to start running immediately on one CPU instead of waiting for more?

Henrique Almeida

8:04 p.m.

Laura, yes, as long as there's around 10 GB of RAM available, and ideally at least 5 harts too, but I expect 50 most of the time, not 5.

On Thu, Aug 1, 2024 at 4:28 PM Laura Hild lsh@jlab.org wrote:

...

So you're wanting that, instead of waiting for the task to finish and then running on the whole node, that the job should run immediately on n-1 CPUs? If there were only one CPU available in the entire cluster, would you want the job to start running immediately on one CPU instead of waiting for more?

-- Henrique Dante de Almeida hdante@gmail.com

Jeffrey Layton

2 Aug 2 Aug

10:33 a.m.

I think all of the replies point to --exclusive being your best solution (only solution?).

You need to know exactly the maximum number of cores a particular application or applications will use. Then you allow other applications to use the unused cores. Otherwise, at some point when the applications are running, they are going to use the same core and you could have problems. I don't know of any way you can allow one application to use more cores than it was allocated without the possibility of multiple applications using the same cores.

Fundamentally you should not have one application using a variable number of cores with a second application also using the same cores. (IMHO)

As everyone has said, your best bet is to use --exclusive and allow an application to have access to all of the cores even if they don't use all of them all the time.

Good luck.

Jeff

P.S. Someone mentioned watching memory usage on the node. That too is important if you do not use --exclusive. Otherwise Mr. OOM will come to visit (the Out Of Memory daemon that starts killing process). In my experience, the OOM kills HPC processes first because they use most of the memory and most of the CPU time.

On Thu, Aug 1, 2024 at 4:06 PM Henrique Almeida via slurm-users < slurm-users@lists.schedmd.com> wrote:

...

Laura, yes, as long as there's around 10 GB of RAM available, and ideally at least 5 harts too, but I expect 50 most of the time, not 5.

On Thu, Aug 1, 2024 at 4:28 PM Laura Hild lsh@jlab.org wrote:

...
So you're wanting that, instead of waiting for the task to finish and

then running on the whole node, that the job should run immediately on n-1 CPUs? If there were only one CPU available in the entire cluster, would you want the job to start running immediately on one CPU instead of waiting for more?

...
-- Henrique Dante de Almeida hdante@gmail.com

-- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-leave@lists.schedmd.com

Cutts, Tim

1:16 p.m.

You can’t have both exclusive access to a node and sharing, that makes no sense. You see this on AWS as well – you can select either sharing a physical machine or not. There is no “don’t share if possible, and share otherwise”.

Unless you configure SLURM to overcommit CPUs, by definition if you request all the CPUs in the machine, you will get exclusive access. But if any of the CPUs are allocated, then your job won’t start.

One way you can improve this, is to configure SLURM to try to fill each node up with jobs first, before starting to schedule jobs to a new node. This isn’t good for traditional HPC MPI jobs, but if your jobs are all multithreaded or single-threaded, this will work quite well, and will keep nodes free so that jobs which do actually require exclusive access are more likely to be scheduled. This probably means (but others please correct me) that you DON’T want CR_LLN, and you probably do want CR_Pack_Nodes.

Tim

-- Tim Cutts Scientific Computing Platform Lead AstraZeneca

Find out more about R&D IT Data, Analytics & AI and how we can support you by visiting our Service Cataloguehttps://azcollaboration.sharepoint.com/sites/CMU993 |

From: Henrique Almeida via slurm-users slurm-users@lists.schedmd.com Date: Thursday, 1 August 2024 at 8:21 PM To: Laura Hild lsh@jlab.org Cc: slurm-users@lists.schedmd.com slurm-users@lists.schedmd.com Subject: [slurm-users] Re: With slurm, how to allocate a whole node for a single multi-threaded process? Hello, sharing would be unavoidable when all nodes are either fully or partially allocated. There will be cases of very simple background tasks occupying, for example, 1 hart in a machine.

On Thu, Aug 1, 2024 at 3:08 PM Laura Hild lsh@jlab.org wrote:

...

Hi Henrique. Can you give an example of sharing being unavoidable?

-- Henrique Dante de Almeida hdante@gmail.com

-- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-leave@lists.schedmd.com ________________________________

AstraZeneca UK Limited is a company incorporated in England and Wales with registered number:03674842 and its registered office at 1 Francis Crick Avenue, Cambridge Biomedical Campus, Cambridge, CB2 0AA.

This e-mail and its attachments are intended for the above named recipient only and may contain confidential and privileged information. If they have come to you in error, you must not copy or show them to anyone; instead, please reply to this e-mail, highlighting the error to the sender and then immediately delete the message. For information about how AstraZeneca UK Limited and its affiliates may process information, personal data and monitor communications, please see our privacy notice at www.astrazeneca.comhttps://www.astrazeneca.com

Laura Hild

2:34 p.m.

My read is that Henrique wants to specify a job to require a variable number of CPUs on one node, so that when the job is at the front of the queue, it will run opportunistically on however many happen to be available on a single node as long as there are at least five.

I don't personally know of a way to specify such a job, and wouldn't be surprised if there isn't one, since as other posters have suggested, usually there's a core-count sweet spot that should be used, achieving a performance goal while making efficient use of resources. A cluster administrator may in fact not want you using extra cores, even if there's a bit more speed-up to be had, when those cores could be used more efficiently by another job. I'm also not sure how one would set a judicious TimeLimit on a job that would have such a variable wall-time.

So there is the question of whether it is possible, and whether it is advisable.

Davide DelVento

4:29 p.m.

I am pretty sure with vanilla slurm is impossible.

What it might be possible (maybe) is submitting 5 core jobs and using some pre-post scripts which immediately before the job start change the requested number of cores to "however are currently available on the node where it is scheduled to run". That feels like a nightmare script to write, prone to race conditions (e.g. what is slurm has scheduled another job on the same node to start almost at the same time?). It also may be impractical (the modified job will probably need to be rescheduled, possibly landing on another node with a different number of idle cores) or impossible (maybe slurm does not offer the possibility of changing the requested nodes after the job has been assigned a node, only at other times, such as submission time).

What is theoretically possible would be to use slurm only as a "dummy bean counter": submit the job as a 5 core job and let it land and start on a node. The job itself does nothing other than counting the number of idle nodes on that core and submitting *another* slurm job of the highest priority targeting that specific node (option -w) and that number of cores. If the second job starts, then by some other mechanism, probably external to slurm, the actual computational job will start on the appropriate nodes. If that happens outside of slurm, it would be very hard to get right (with the appropriate cgroup for example). If that happens inside of slurm, it needs some functionality which I am not aware exists, but it sounds more likely than "changing the number of cores at the moment the job start". For example the two jobs could merge into one. Or the two jobs could stay separate, but share some MPI communicator or thread space (but again have troubles with the separate cgroups they live in).

So in conclusion if this is just a few jobs where you are trying to be more efficient, I think it's better to give up. If this is something of really large scale and important, then my recommendation would be to purchase official Slurm support and get assistance from them

On Fri, Aug 2, 2024 at 8:37 AM Laura Hild via slurm-users < slurm-users@lists.schedmd.com> wrote:

...

My read is that Henrique wants to specify a job to require a variable number of CPUs on one node, so that when the job is at the front of the queue, it will run opportunistically on however many happen to be available on a single node as long as there are at least five.

I don't personally know of a way to specify such a job, and wouldn't be surprised if there isn't one, since as other posters have suggested, usually there's a core-count sweet spot that should be used, achieving a performance goal while making efficient use of resources. A cluster administrator may in fact not want you using extra cores, even if there's a bit more speed-up to be had, when those cores could be used more efficiently by another job. I'm also not sure how one would set a judicious TimeLimit on a job that would have such a variable wall-time.

So there is the question of whether it is possible, and whether it is advisable.

-- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-leave@lists.schedmd.com

Daniel Letai

5 Aug 5 Aug

8:25 a.m.

Henrique Almeida

6 Aug 6 Aug

7:36 p.m.

Hello, everyone, I'll answer everyone in a single reply because I've reached a conclusion: I'll give up on the idea of using shared nodes and will require exclusive allocation to a whole node. The final command line used will be:

sbatch -N 1 --exclusive --ntasks-per-node=1 --mem=0 pz-train.batch

Thank you everyone for the discussion,

On Mon, Aug 5, 2024 at 5:27 AM Daniel Letai via slurm-users slurm-users@lists.schedmd.com wrote:

...

I think the issue is more severe than you describe.

Slurm juggles the needs of many jobs. Just because there are some resources available at the exact second a job starts, doesn't mean those resource are not pre-allocated for some future job waiting for even more resources, or what about the use case of the opportunistic job being a backfill job, and prevents a higher priority job from starting, or being pushed back due to asking more resources at the last minute?

The request, while understandable from a user's point of view, is a non-starter for a shared cluster.

Just my 2 cents.

On 02/08/2024 17:34, Laura Hild via slurm-users wrote:

My read is that Henrique wants to specify a job to require a variable number of CPUs on one node, so that when the job is at the front of the queue, it will run opportunistically on however many happen to be available on a single node as long as there are at least five.

I don't personally know of a way to specify such a job, and wouldn't be surprised if there isn't one, since as other posters have suggested, usually there's a core-count sweet spot that should be used, achieving a performance goal while making efficient use of resources. A cluster administrator may in fact not want you using extra cores, even if there's a bit more speed-up to be had, when those cores could be used more efficiently by another job. I'm also not sure how one would set a judicious TimeLimit on a job that would have such a variable wall-time.

So there is the question of whether it is possible, and whether it is advisable.

-- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-leave@lists.schedmd.com

-- Henrique Dante de Almeida hdante@gmail.com

351

Age (days ago)

356

Last active (days ago)

slurm-users@lists.schedmd.com

16 comments

8 participants

tags (0)

participants (8)

Bill
Cutts, Tim
Daniel Letai
Davide DelVento
Henrique Almeida
Jason Simms
Jeffrey Layton
Laura Hild