On 26/8/24 8:40 am, Di Bernardini, Fabio via slurm-users wrote:
Hi everyone, for accounting reasons, I need to create only one job across two or more federated clusters with two or more srun steps.
The limitations for heterogenous jobs say:
https://slurm.schedmd.com/heterogeneous_jobs.html#limitations
In a federation of clusters, a heterogeneous job will execute entirely on the cluster from which the job is submitted. The heterogeneous job will not be eligible to migrate between clusters or to have different components of the job execute on different clusters in the federation.
However, from your script it's not clear to me that's what you're meaning, because you include multiple --cluster options. I'm not sure if that works, as you mention the docs don't cover that case. They do say (however) that:
If a heterogeneous job is submitted to run in multiple clusters not part of a federation (e.g. "sbatch --cluster=alpha,beta ...") then the entire job will be sent to the cluster expected to be able to start all components at the earliest time.
My gut instinct is that this isn't going to work, my feeling is that to launch a heterogenous job like this requires the slurmctld's on each cluster to coordinate and I'm not aware of that being possible currently.
All the best, Chris