[slurm-users] Should I join the federation?

Vicker, Darby (JSC-EG311) darby.vicker-1 at nasa.gov
Mon Feb 12 07:47:21 MST 2018


We recently brought a new cluster online with the desire to federate it with our existing cluster.  See the full story here:

https://bugs.schedmd.com/show_bug.cgi?id=4512

There are some fairly large limitations to federation, the biggest of which (for us anyway) was:

> The current implementation assumes all systems in the federation 
> are largely identical. We hope to address this in future versions.

I initially thought this would be a show stopper for us but we were able to modify our job_submit.lua to work around that issue for our use case.  We haven't actually federated our two clusters yet.  We are still testing things out with -M submissions.  If that works out, we probably will federate them.  

My initial solution to getting slurm working on multiple clusters also involved setting SLURM_CONF to a different location (we have slurm installed on an NFS share that gets mounted on both clusters).  As pointed out in Bug 4573, this isn't a good solution for multi-cluster operation since, by default anyway, environment variables are exported to the job on another cluster.  This will confuse slurm if the job starts on another cluster.  The solution I chose was to configure slurm to always look in /etc/slurm/ and have that directory be a sym link to the proper slurm configuration directory for that cluster.  That seems to work well for us.  

Our plugins are pretty much the same between our two clusters so I'm not sure about that question.  

Hope that helps. 

Darby




-----Original Message-----
From: slurm-users <slurm-users-bounces at lists.schedmd.com> on behalf of Yair Yarom <irush at cs.huji.ac.il>
Reply-To: Slurm User Community List <slurm-users at lists.schedmd.com>
Date: Monday, February 12, 2018 at 4:15 AM
To: "slurm-users at schedmd.com" <slurm-users at schedmd.com>
Subject: [slurm-users] Should I join the federation?


Hi all,

I was wondering if any of you can share your insights regarding
federations. What unexpected caveats have you encountered?

We have here about about 15 "small" clusters (due to political and
technical reasons), and most users have access to more than one
cluster. Federation seems like a good solution instead of users running
between clusters searching for available resources (we'll probably have
2-4 federations...).

I would also want to have a single submission node, but then users will
still need to select a cluster (we have an lmod module to select a
cluster by setting PATH and SLURM_CONF). The solution I've come up is to
create a dummy cluster with a lot of drained resources. But this seem
like a not-so-good solution and might confuse users with always pending
jobs, and will not work with array jobs.

Also, is there a way to set such that by default jobs will be submitted
to the current cluster instead of the federation (i.e. -M <cluster> by
default)? I guess this can be done by a plugin (can it? or does it run
after the sibling submissions?), but I was wondering if there's already
a solution.

Last question :), are there any issues with plugins? i.e. we have
different plugins for different clusters, if they change some of the job
parameters, should I be worried about about plugins from the origin
cluster or from the sibling cluster? Will the job have several plugins
from several clusters activated on it?

Thanks in advance for any advice,
    Yair.





More information about the slurm-users mailing list