[slurm-users] Can't get fed job lock from origin cluster to backfill job

Caubet Serrabou Marc (PSI) marc.caubet at psi.ch
Tue Jul 2 12:52:08 UTC 2019


Hi,

I have a federation of 2 clusters 'merlin5' and 'merlin6'. However for some reason I have two jobs in a strange state, one in FedJobLock and the second in Priority (which never gets allocated and I am not able to cancel):

         134225868       gpu     bash bliven_s PD       0:00      1 (FedJobLock)
         134225867       gpu     bash bliven_s PD       0:00      1 (Priority)

I try to cancel the jobs, no way. From the Slurm server logs I see the following:

Merlin5:

[2019-07-02T14:20:14.252] backfill test for JobId=134225868 Prio=3559 Partition=gpu
[2019-07-02T14:20:14.293] backfill: JobId=134225868 can't get fed job lock from origin cluster to backfill job
[2019-07-02T14:20:14.293] backfill: planned start of JobId=134225868 failed: Job locked by another sibling
[2019-07-02T14:20:14.293] JobId=134225868 to start at 2019-07-02T14:20:14, end at 2019-07-07T14:20:00 on nodes merlin-g-01 in partition gpu
[2019-07-02T14:20:14.294] backfill test for JobId=134225867 Prio=3559 Partition=gpu
[2019-07-02T14:20:14.374] backfill: JobId=134225867 can't get fed job lock from origin cluster to backfill job
[2019-07-02T14:20:14.374] backfill: planned start of JobId=134225867 failed: Job locked by another sibling
[2019-07-02T14:20:14.374] JobId=134225867 to start at 2019-07-02T14:20:14, end at 2019-07-07T14:20:00 on nodes merlin-g-04 in partition gpu
[2019-07-02T14:20:14.374] backfill: reached end of job queue
[2019-07-02T14:20:14.374] backfill: completed testing 2(2) jobs, usec=122038
[2019-07-02T14:20:18.052] _slurm_rpc_kill_job: REQUEST_KILL_JOB JobId=134225868 uid 0 routed to merlin6
[2019-07-02T14:20:18.052] _slurm_rpc_kill_job: REQUEST_KILL_JOB JobId=134225867 uid 0 routed to merlin6

Merlin6:

[2019-07-02T14:20:21.755] backfill: beginning
[2019-07-02T14:20:21.756] backfill: no jobs to backfill
[2019-07-02T14:20:44.415] error: Didn't find JobId=134225868 in fed_job_list
[2019-07-02T14:20:44.456] error: Didn't find JobId=134225867 in fed_job_list
[2019-07-02T14:20:51.756] backfill: beginning
[2019-07-02T14:20:51.756] backfill: no jobs to backfill
[2019-07-02T14:21:09.721] error: Didn't find JobId=134225868 in fed_job_list
[2019-07-02T14:21:14.537] error: Didn't find JobId=134225868 in fed_job_list
[2019-07-02T14:21:14.578] error: Didn't find JobId=134225867 in fed_job_list

While from the accounting server:

           bliven_s 134225867          bash        gpu    PENDING Partition+             Unknown             Unknown   00:00:00                              1          1   00:00:00                  None assigned    merlin5
            bliven_s 134225868          bash        gpu    PENDING Partition+             Unknown             Unknown   00:00:00                              1          1   00:00:00                  None assigned    merlin5


Any idea how to fix that and what could trigger this?

Thanks a lot,
Marc
_________________________________________________________
Paul Scherrer Institut
High Performance Computing & Emerging Technologies
Marc Caubet Serrabou
Building/Room: OHSA/014
Forschungsstrasse, 111
5232 Villigen PSI
Switzerland

Telephone: +41 56 310 46 67
E-Mail: marc.caubet at psi.ch
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20190702/2e10d387/attachment.html>


More information about the slurm-users mailing list