[slurm-users] Issues with HA config and AllocNodes

Dave Sizer dsizer at nvidia.com
Tue Dec 17 17:23:34 UTC 2019


Hello friends,

We are running slurm 19.05.1-2 with an HA setup consisting of one primary and one backup controller.  However, we are observing that when the backup takes over, for some reason AllocNodes is getting set to “none” on all of our partitions.  We can remedy this by manually setting AllocNodes=ALL on each partition, however this is not feasible in production, since any jobs launched just before the takeover still fail to submit (before the partitions can be manually updated).  For reference, the backup controller has the correct config if it is restarted AFTER the primary is taken down, so this issue seems isolated to the takeover flow.

Has anyone seen this issue before?  Or any hints for how I can debug this problem?

Thanks in advance!

Dave

-----------------------------------------------------------------------------------
This email message is for the sole use of the intended recipient(s) and may contain
confidential information.  Any unauthorized review, use, disclosure or distribution
is prohibited.  If you are not the intended recipient, please contact the sender by
reply email and destroy all copies of the original message.
-----------------------------------------------------------------------------------
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20191217/67a692ec/attachment.htm>


More information about the slurm-users mailing list