Hi all,
A problem on slurm-23.02.4-1, 10.6.16-MariaDB; Maria and Slurmctld in active/active, SlurmDB in active/off, shared IP. Shared spool via Gluster. DB is an upgraded version of Slurm from somewhere 2017 (upgraded various times). The question is whether we should give up and start from scratch or if there's an easy fix.
Problem: whenever we add a new user and add it to sacctmgr, the user shows up properly in sacct/mgr – but never shows up with the sshare commands after running some jobs. After restarting slurm a couple of times it shows up. Problem seems to be there also in the previous version.
Only error we can see in slurmdb log:
[2023-12-21T09:43:30.586] error: slurm_persist_conn_open: Something happened with the receiving/processing of the persistent connection init message to 10.141.255.253:6817 : (null) [2023-12-21T09:43:30.586] error: slurmdb_send_accounting_update_persist: Unable to open connection to registered cluster cluster. [2023-12-21T09:43:30.586] error: slurm_receive_msg: No response to persist_init [2023-12-21T09:43:30.586] error: update cluster: No error to cluster at 10.141.255.253(6817) [2023-12-21T09:43:30.586] debug2: DBD_FINI: CLOSE:1 COMMIT:0 [2023-12-21T09:43:30.586] debug4: accounting_storage/as_mysql: acct_storage_p_commit: got 0 commits
AccountingStorageType=accounting_storage/slurmdbd
# jobaccounting JobAcctGatherFrequency=30 JobAcctGatherType=jobacct_gather/linux
SlurmctldTimeout=60 SlurmdTimeout=60 TCPTimeout=60 MessageTimeout=60
Best regards,
Alex