[slurm-users] 20.11.8: Altered federation code ? "siblings not synced yet" messages

Kevin Buckley Kevin.Buckley at pawsey.org.au
Mon Jul 5 06:21:30 UTC 2021

On 2021/07/05 11:39, Kevin Buckley wrote:
> Upgrade our Cray TDS from 20.11.7 to 20.11.8, without making any
> changes to the configuration but am not now seeing job start to
> run, whilst seeing messages in the slurmd log akin to these four
>    Submitted federated JobId=67122494 to tdsname(self)
>    _slurm_rpc_submit_batch_job: JobId=67122494 InitPrio=0 usec=8208
>    sched: schedule() returning, federation siblings not synced yet
>    sched/backfill: _attempt_backfill: returning, federation siblings not synced yet
> none of which were in evidence prior to the upgrade.
> Didn't see anything in the 20.11.8 changes that suggested anything
> to do with "federation" had been introduced, though yet to trawl
> through the code.
> Anyone seen similar?
> Kevin

Starting to look as though something federation-related may have been
"fixed" in 20.11.8, or "unfixed" for combinations of federations of
differing Slurm versions?

Even if I leave the Cray TDS cluster in a federation of one - it had
previoulsy been operating within a federation of two, with a non-Cray
TDS cluster - then jobs start to run within it again.

Supercomputing Systems Administrator
Pawsey Supercomputing Centre

More information about the slurm-users mailing list