<div dir="ltr"><div><div><div><div>After last night's announcement, I decided to start the upgrade process.<br><br>Build went fine - once I worked out where munge went - and installation also seemed fine.<br><br>slurmctld won't restart though.<br><br>In the logs I'm seeing:<br><br>[2018-05-31T15:20:50.810] debug: Munge encode failed: Failed to access "xxxxxxxx": No such file or directory (retrying ...)<br>[2018-05-31T15:20:50.824] debug: Recovered 4 tres<br>[2018-05-31T15:20:50.825] debug: Recovered 3 users<br>[2018-05-31T15:20:50.825] debug: Recovered 0 resources<br>[2018-05-31T15:20:50.825] debug: Recovered 1 qos<br>[2018-05-31T15:20:50.825] debug: Recovered 8 associations<br>[2018-05-31T15:20:50.872] fatal: You are running with a database but for some reason we have less TRES than should be here (4 < 5) and/or the "billing" TRES is missing. This should only happen if the database is down after an upgrade.<br><br>The first issue is that <br><br>debug: Munge encode failed: Failed to access "xxxxxx": No such file or directory (retrying ...)<br><br>contains the password in clear text ("xxxxx"). This is doubly confusing - "failed to access" would indicate it meant to have the database name (StorageLoc) rather than the database password (StoragePass). If it is meant to be using the password, I don't think it should be clear text and (in my mind) the language should be clearer.<br><br></div>The second issue is that slurmctld.service wont start. The last error shown above<br><br>fatal: You are running with a database but for some reason we have less
TRES than should be here (4 < 5) and/or the "billing" TRES is
missing. This should only happen if the database is down after an
upgrade.<br><br></div>Has a couple of hits in Google - an unanswered email from January <br><a href="https://groups.google.com/d/msg/slurm-users/iZsSVlqQAyE/rKiSWihyEQAJ">https://groups.google.com/d/msg/slurm-users/iZsSVlqQAyE/rKiSWihyEQAJ</a><br><br></div>and a bug report<br><a href="https://bugs.schedmd.com/show_bug.cgi?id=4579">https://bugs.schedmd.com/show_bug.cgi?id=4579</a><br><br></div>which seems to have solved a slightly different but similar problem. The fix suggested in that bug report doesn't work: using MariaDB_server 5.2.x my tres_table didn't have gres in it anyway.<br><br>+---------------+---------+------+----------------+------+<br>| creation_time | deleted | id | type | name |<br>+---------------+---------+------+----------------+------+<br>| 1527744028 | 0 | 1 | cpu | |<br>| 1527744028 | 0 | 2 | mem | |<br>| 1527744028 | 0 | 3 | energy | |<br>| 1527744028 | 0 | 4 | node | |<br>| 1527744028 | 0 | 5 | billing | |<br>| 1527744028 | 1 | 1000 | dynamic_offset | |<br>+---------------+---------+------+----------------+------+<br><div><div><div><div><div><br><br></div><div>No idea what to try next. Any hints would be appreciated.<br><br></div><div>Running on CentOS 7.5, upgrading from 17.02.8 (and I dropped the slurmdbd db and restarted it from empty when the bug report didn't work)<br></div><div><br></div><div>L.<br></div><div><br>------<br>"The antidote to apocalypticism is apocalyptic civics. Apocalyptic civics is the insistence that we cannot ignore the truth, nor should we panic about it. It is a shared consciousness that our institutions have failed and our ecosystem is collapsing, yet we are still here — and we are creative agents who can shape our destinies. Apocalyptic civics is the conviction that the only way out is through, and the only way through is together. "<br><br>Greg Bloom @greggish <a href="https://twitter.com/greggish/status/873177525903609857">https://twitter.com/greggish/status/873177525903609857</a></div></div></div></div></div></div>