I'm confused. Why can't they just use a multi-node job, and have the job script farm out the individual tasks to the various workers through some mechanism (srun, mpirun, ssh, etc.)? AFAIK, there's nothing preventing a job from using resources on multiple hosts. The job just needs to have some way of pushing the work out to those hosts.

Lloyd

On 7/8/24 14:17, Dan Healy via slurm-users wrote:

Hi there,

I've received a question from an end user, which I presume the answer is "No", but would like to ask the community first.

Scenario: The user wants to create a series of jobs that all need to start at the same time. Example: there are 10 different executable applications which have varying CPU and RAM constraints, all of which need to communicate via TCP/IP. Of course the user could design some type of idle/statusing mechanism to wait until all jobs are randomly started, then begin execution, but this feels like a waste of resources. The complete execution of these 10 applications would be considered a single simulation. The goal would be to distribute these 10 applications across the cluster and not necessarily require them all to execute on a single node.

Is there a good architecture for this using SLURM? If so, please kindly point me in the right direction.

--

Thanks,

Daniel Healy

-- 
Lloyd Brown
HPC Systems Administrator
Office of Research Computing
Brigham Young University
http://rc.byu.edu