[slurm-users] Issues with HA config and AllocNodes

Thu Dec 19 17:44:53 UTC 2019

So I’ve found some more info on this. It seems like the primary controller is writing  “ none” as the AllocNodes value in the partition state file when it shuts down.  It does this even with the backup out of the picture, and it still happens even when I switched the primary and backup controller nodes in the config.

When the primary starts up, it ignores these none values and sets AllocNodes=ALL on all partitions (what we want), but when the backup starts up, it “honors” the none values and all partitions have AllocNodes=none set.  Again, the slurm.conf on both nodes are the same, and this happens even when swapping the primary/backup roles of the nodes. I am digging through the source to try and find some hints.

Does anyone have any ideas?

From: slurm-users <slurm-users-bounces at lists.schedmd.com> on behalf of Dave Sizer <dsizer at nvidia.com>
Reply-To: Slurm User Community List <slurm-users at lists.schedmd.com>
Date: Tuesday, December 17, 2019 at 1:05 PM
To: Brian Andrus <toomuchit at gmail.com>, "slurm-users at schedmd.com" <slurm-users at schedmd.com>
Subject: Re: [slurm-users] Issues with HA config and AllocNodes

External email: Use caution opening links or attachments

Thanks for the response.

I have confirmed that the slurm.conf files are the same and that StateSaveDir is working, we see logs like the following on the backup controller:
Recovered state of 9 partitions
Recovered JobId=124 Assoc=6
Recovered JobId=125 Assoc=6
Recovered JobId=126 Assoc=6
Recovered JobId=127 Assoc=6
Recovered JobId=128 Assoc=6

I do see the following error when the backup takes control, but not sure if it is related since it continues to start up fine:

error: _shutdown_bu_thread:send/recv slurm-ctrl-02: Connection refused

We also see a lot of these messages on the backup while it is in standby mode, but from what I’ve researched these maybe unrelated as well?

error: Invalid RPC received 1002 while in standby mode

and similar messages with other RPC codes. We no longer see these once the backup controller has taken control.

I do agree with the idea that there is some issue with the saving/loading of partition state during takeover, I’m just a bit stumped on why it is happening and what to do to stop partitions being loaded with the AllocNodes=none config.

From: Brian Andrus <toomuchit at gmail.com>
Date: Tuesday, December 17, 2019 at 12:30 PM
To: Dave Sizer <dsizer at nvidia.com>
Subject: Re: [slurm-users] Issues with HA config and AllocNodes

External email: Use caution opening links or attachments

Double check that your slurm.conf are the same and that both systems are successfully using your savestate directory

Brian Andrus
On 12/17/2019 9:23 AM, Dave Sizer wrote:
Hello friends,

We are running slurm 19.05.1-2 with an HA setup consisting of one primary and one backup controller.  However, we are observing that when the backup takes over, for some reason AllocNodes is getting set to “none” on all of our partitions.  We can remedy this by manually setting AllocNodes=ALL on each partition, however this is not feasible in production, since any jobs launched just before the takeover still fail to submit (before the partitions can be manually updated).  For reference, the backup controller has the correct config if it is restarted AFTER the primary is taken down, so this issue seems isolated to the takeover flow.

Has anyone seen this issue before?  Or any hints for how I can debug this problem?

Thanks in advance!

Dave
________________________________
This email message is for the sole use of the intended recipient(s) and may contain confidential information.  Any unauthorized review, use, disclosure or distribution is prohibited.  If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message.
________________________________
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20191219/dfbc0c9e/attachment.htm>