[slurm-users] Upgrade woes

Lachlan Musicman datakid at gmail.com
Wed May 30 23:39:41 MDT 2018


After last night's announcement, I decided to start the upgrade process.

Build went fine - once I worked out where munge went - and installation
also seemed fine.

slurmctld won't restart though.

In the logs I'm seeing:

[2018-05-31T15:20:50.810] debug:  Munge encode failed: Failed to access
"xxxxxxxx": No such file or directory (retrying ...)
[2018-05-31T15:20:50.824] debug:  Recovered 4 tres
[2018-05-31T15:20:50.825] debug:  Recovered 3 users
[2018-05-31T15:20:50.825] debug:  Recovered 0 resources
[2018-05-31T15:20:50.825] debug:  Recovered 1 qos
[2018-05-31T15:20:50.825] debug:  Recovered 8 associations
[2018-05-31T15:20:50.872] fatal: You are running with a database but for
some reason we have less TRES than should be here (4 < 5) and/or the
"billing" TRES is missing. This should only happen if the database is down
after an upgrade.

The first issue is that

debug:  Munge encode failed: Failed to access "xxxxxx": No such file or
directory (retrying ...)

contains the password in clear text ("xxxxx"). This is doubly confusing -
"failed to access" would indicate it meant to have the database name
(StorageLoc) rather than the database password (StoragePass). If it is
meant to be using the password, I don't think it should be clear text and
(in my mind) the language should be clearer.

The second issue is that slurmctld.service wont start. The last error shown
above

fatal: You are running with a database but for some reason we have less
TRES than should be here (4 < 5) and/or the "billing" TRES is missing. This
should only happen if the database is down after an upgrade.

Has a couple of hits in Google - an unanswered email from January
https://groups.google.com/d/msg/slurm-users/iZsSVlqQAyE/rKiSWihyEQAJ

and a bug report
https://bugs.schedmd.com/show_bug.cgi?id=4579

which seems to have solved a slightly different but similar problem. The
fix suggested in that bug report doesn't work: using MariaDB_server 5.2.x
my tres_table didn't have gres in it anyway.

+---------------+---------+------+----------------+------+
| creation_time | deleted | id   | type           | name |
+---------------+---------+------+----------------+------+
|    1527744028 |       0 |    1 | cpu            |      |
|    1527744028 |       0 |    2 | mem            |      |
|    1527744028 |       0 |    3 | energy         |      |
|    1527744028 |       0 |    4 | node           |      |
|    1527744028 |       0 |    5 | billing        |      |
|    1527744028 |       1 | 1000 | dynamic_offset |      |
+---------------+---------+------+----------------+------+


No idea what to try next. Any hints would be appreciated.

Running on CentOS 7.5, upgrading from 17.02.8 (and I dropped the slurmdbd
db and restarted it from empty when the bug report didn't work)

L.

------
"The antidote to apocalypticism is apocalyptic civics. Apocalyptic civics
is the insistence that we cannot ignore the truth, nor should we panic
about it. It is a shared consciousness that our institutions have failed
and our ecosystem is collapsing, yet we are still here — and we are
creative agents who can shape our destinies. Apocalyptic civics is the
conviction that the only way out is through, and the only way through is
together. "

Greg Bloom @greggish https://twitter.com/greggish/status/873177525903609857
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20180531/8deb6236/attachment-0001.html>


More information about the slurm-users mailing list