Hi there,
I've received a question from an end user, which I presume the answer is "No", but would like to ask the community first.
Scenario: The user wants to create a series of jobs that all need to start at the same time. Example: there are 10 different executable applications which have varying CPU and RAM constraints, all of which need to communicate via TCP/IP. Of course the user could design some type of idle/statusing mechanism to wait until all jobs are *randomly *started, then begin execution, but this feels like a waste of resources. The complete execution of these 10 applications would be considered a single simulation. The goal would be to distribute these 10 applications across the cluster and not necessarily require them all to execute on a single node.
Is there a good architecture for this using SLURM? If so, please kindly point me in the right direction.
I think the best way to do it would be to schedule the 10 things to be a single slurm job and then use some of the various MPMD ways (the nitty gritty details depend if each executable is serial, OpenMP, MPI or hybrid).
On Mon, Jul 8, 2024 at 2:20 PM Dan Healy via slurm-users < slurm-users@lists.schedmd.com> wrote:
Hi there,
I've received a question from an end user, which I presume the answer is "No", but would like to ask the community first.
Scenario: The user wants to create a series of jobs that all need to start at the same time. Example: there are 10 different executable applications which have varying CPU and RAM constraints, all of which need to communicate via TCP/IP. Of course the user could design some type of idle/statusing mechanism to wait until all jobs are *randomly *started, then begin execution, but this feels like a waste of resources. The complete execution of these 10 applications would be considered a single simulation. The goal would be to distribute these 10 applications across the cluster and not necessarily require them all to execute on a single node.
Is there a good architecture for this using SLURM? If so, please kindly point me in the right direction.
-- Thanks,
Daniel Healy
-- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-leave@lists.schedmd.com
Dan, The requirement for varying CPU and RAM requirements sounds like it could be met with the Heterogeneous Jobs feature (https://slurm.schedmd.com/heterogeneous_jobs.html https://slurm.schedmd.com/heterogeneous_jobs.html) of Slurm. Take a look at that document and see if it meets your needs.
Mike Robbert Cyberinfrastructure Specialist, Cyberinfrastructure and Advanced Research Computing Information and Technology Solutions (ITS) 303-273-3786 | mrobbert@mines.edu mailto:mrobbert@mines.edu
On 7/8/24, 14:20, "Dan Healy via slurm-users" slurm-users@lists.schedmd.com wrote:
CAUTION: This email originated from outside of the Colorado School of Mines organization. Do not click on links or open attachments unless you recognize the sender and know the content is safe.
Hi there,
I've received a question from an end user, which I presume the answer is "No", but would like to ask the community first.
Scenario: The user wants to create a series of jobs that all need to start at the same time. Example: there are 10 different executable applications which have varying CPU and RAM constraints, all of which need to communicate via TCP/IP. Of course the user could design some type of idle/statusing mechanism to wait until all jobs are randomly started, then begin execution, but this feels like a waste of resources. The complete execution of these 10 applications would be considered a single simulation. The goal would be to distribute these 10 applications across the cluster and not necessarily require them all to execute on a single node.
Is there a good architecture for this using SLURM? If so, please kindly point me in the right direction.