Hi,
We're running into an issue where slurmctld core-dumps with the
following error. This happens on the backup controller, if it needs to
take over from the primary, _for a second time_.
slurmctld: fatal: bit_cache_init: cannot change size once set
Has anyone seen this error before? Also if there are any existing
discussions and/or tickets related to this, please let me know. Our
slurm version is 24.11.1.
________________
Steps to reproduce:
1. On a healthy cluster, we make the primary controller unavailable.
Since we're running our cluster on cloud VMs, we cause this by
stopping the primary controller VM.
2. From the logs we can see the backup controller take over, and log
the message "Running as primary controller"
3. We then start the primary again, making sure the IP addresses and
hostnames stay consistent. Once slurmctld on the primary has started
and taken back control, we can see the log "Running as primary
controller" on that VM.
4. We then stop the primary controller VM again, causing the backup to
try taking the control a second time. This time however the slurmctld
on the backup coredumps, with following log entries from journalctl -u
slurmctld:
slurmctld: fatal: bit_cache_init: cannot change size once set
slurmctld.service: Main process exited, code=dumped, status=6/ABRT
slurmctld.service: Failed with result 'core-dump'.
Thanks!
- Safdar