Hi everyone, for accounting reasons, I need to create only one job across two or more federated clusters with two or more srun steps. I'm trying with hetjobs but it's not clear to me from the documentation (https://slurm.schedmd.com/heterogeneous_jobs.html) if this is possible and how to do it. I'm trying with this script, but the steps are executed on only the first cluster. Can you tell me if there is a mistake in the hetjob or if it has to be done in another way?
#!/bin/bash
#SBATCH hetjob #SBATCH --clusters=cluster2 srun -v --het-group=0 hostname
#SBATCH hetjob #SBATCH --clusters=cluster3 srun -v --het-group=1 hostname
NICE SRL, viale Monte Grappa 3/5, 20124 Milano, Italia, Registro delle Imprese di Milano Monza Brianza Lodi REA n. 2096882, Capitale Sociale: 10.329,14 EUR i.v., Cod. Fisc. e P.IVA 01133050052, Societa con Socio Unico
Ciao Fabio,
That for sure is syntactically incorrect, because the way sbatch parsing works: as soon as it finds a non-empy non-comment line (your first srun) it will stop parsing for #SBATCH directives. So assuming this is a single file as it looks from the formatting, the second hetjob and the cluster3 are ignored. Now, if these are two separate files, they would be two separate jobs, so that's not going to work either.
More specifically to your question, I can't help because I don't have experience with federated clusters.
On Mon, Aug 26, 2024 at 9:43 AM Di Bernardini, Fabio via slurm-users < slurm-users@lists.schedmd.com> wrote:
Hi everyone, for accounting reasons, I need to create only one job across two or more federated clusters with two or more srun steps.
I’m trying with hetjobs but it's not clear to me from the documentation ( https://slurm.schedmd.com/heterogeneous_jobs.html) if this is possible and how to do it.
I'm trying with this script, but the steps are executed on only the first cluster.
Can you tell me if there is a mistake in the hetjob or if it has to be done in another way?
#!/bin/bash
#SBATCH hetjob
#SBATCH --clusters=cluster2
srun -v --het-group=0 hostname
#SBATCH hetjob
#SBATCH --clusters=cluster3
srun -v --het-group=1 hostname
NICE SRL, viale Monte Grappa 3/5, 20124 Milano, Italia, Registro delle Imprese di Milano Monza Brianza Lodi REA n. 2096882, Capitale Sociale: 10.329,14 EUR i.v., Cod. Fisc. e P.IVA 01133050052, Societa con Socio Unico
-- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-leave@lists.schedmd.com
On 26/8/24 8:40 am, Di Bernardini, Fabio via slurm-users wrote:
Hi everyone, for accounting reasons, I need to create only one job across two or more federated clusters with two or more srun steps.
The limitations for heterogenous jobs say:
https://slurm.schedmd.com/heterogeneous_jobs.html#limitations
In a federation of clusters, a heterogeneous job will execute entirely on the cluster from which the job is submitted. The heterogeneous job will not be eligible to migrate between clusters or to have different components of the job execute on different clusters in the federation.
However, from your script it's not clear to me that's what you're meaning, because you include multiple --cluster options. I'm not sure if that works, as you mention the docs don't cover that case. They do say (however) that:
If a heterogeneous job is submitted to run in multiple clusters not part of a federation (e.g. "sbatch --cluster=alpha,beta ...") then the entire job will be sent to the cluster expected to be able to start all components at the earliest time.
My gut instinct is that this isn't going to work, my feeling is that to launch a heterogenous job like this requires the slurmctld's on each cluster to coordinate and I'm not aware of that being possible currently.
All the best, Chris