[slurm-users] Upgrade woes

Thu May 31 05:29:07 MDT 2018

Hi,

I haven't done this upgrade but generally if I were you I would start by
verifying the simple things.

 Is munge working independent of slurm? ( There is a munge encode/decode
command line floating around on the slurm web page for testing. Is munge
looking for the keys in the right place AND is it happy with all the
directory and file permission - it is pickier than anything else I have
encountered in that respect. Check the munge startup logs, confirm it is
running and restarts ok on all the boxes involved.

I know the upgrades in the past which required a database schema upgrade
required the upgraded slurmdbd to be started first, to do the update -
which can take awhile/minutes - and then to start slurmctld. Check
slurmctld logs for a start-up ok and or schema update complete message.

Since you said build and you upgraded, double check that slurm
configuration is being read from where you expect and that  those files are
valid and didn't get overwritten or something.  You might even verify that
the same munge is being found as before. Increase debug and start things
with strace looking for open calls...

Verify DB connectivity using the information in the config file - if you
specify a host, user, password and database check that connecting with
those works using MySQL from the command line.

Just some random things to try above. I hope you get it working! :)

Best,
Chris

On Thu, May 31, 2018, 01:42 Lachlan Musicman <datakid at gmail.com> wrote:

> After last night's announcement, I decided to start the upgrade process.
>
> Build went fine - once I worked out where munge went - and installation
> also seemed fine.
>
> slurmctld won't restart though.
>
> In the logs I'm seeing:
>
> [2018-05-31T15:20:50.810] debug:  Munge encode failed: Failed to access
> "xxxxxxxx": No such file or directory (retrying ...)
> [2018-05-31T15:20:50.824] debug:  Recovered 4 tres
> [2018-05-31T15:20:50.825] debug:  Recovered 3 users
> [2018-05-31T15:20:50.825] debug:  Recovered 0 resources
> [2018-05-31T15:20:50.825] debug:  Recovered 1 qos
> [2018-05-31T15:20:50.825] debug:  Recovered 8 associations
> [2018-05-31T15:20:50.872] fatal: You are running with a database but for
> some reason we have less TRES than should be here (4 < 5) and/or the
> "billing" TRES is missing. This should only happen if the database is down
> after an upgrade.
>
> The first issue is that
>
> debug:  Munge encode failed: Failed to access "xxxxxx": No such file or
> directory (retrying ...)
>
> contains the password in clear text ("xxxxx"). This is doubly confusing -
> "failed to access" would indicate it meant to have the database name
> (StorageLoc) rather than the database password (StoragePass). If it is
> meant to be using the password, I don't think it should be clear text and
> (in my mind) the language should be clearer.
>
> The second issue is that slurmctld.service wont start. The last error
> shown above
>
> fatal: You are running with a database but for some reason we have less
> TRES than should be here (4 < 5) and/or the "billing" TRES is missing. This
> should only happen if the database is down after an upgrade.
>
> Has a couple of hits in Google - an unanswered email from January
> https://groups.google.com/d/msg/slurm-users/iZsSVlqQAyE/rKiSWihyEQAJ
>
> and a bug report
> https://bugs.schedmd.com/show_bug.cgi?id=4579
>
> which seems to have solved a slightly different but similar problem. The
> fix suggested in that bug report doesn't work: using MariaDB_server 5.2.x
> my tres_table didn't have gres in it anyway.
>
> +---------------+---------+------+----------------+------+
> | creation_time | deleted | id   | type           | name |
> +---------------+---------+------+----------------+------+
> |    1527744028 |       0 |    1 | cpu            |      |
> |    1527744028 |       0 |    2 | mem            |      |
> |    1527744028 |       0 |    3 | energy         |      |
> |    1527744028 |       0 |    4 | node           |      |
> |    1527744028 |       0 |    5 | billing        |      |
> |    1527744028 |       1 | 1000 | dynamic_offset |      |
> +---------------+---------+------+----------------+------+
>
>
> No idea what to try next. Any hints would be appreciated.
>
> Running on CentOS 7.5, upgrading from 17.02.8 (and I dropped the slurmdbd
> db and restarted it from empty when the bug report didn't work)
>
> L.
>
> ------
> "The antidote to apocalypticism is apocalyptic civics. Apocalyptic civics
> is the insistence that we cannot ignore the truth, nor should we panic
> about it. It is a shared consciousness that our institutions have failed
> and our ecosystem is collapsing, yet we are still here — and we are
> creative agents who can shape our destinies. Apocalyptic civics is the
> conviction that the only way out is through, and the only way through is
> together. "
>
> Greg Bloom @greggish
> https://twitter.com/greggish/status/873177525903609857
>
-- 
Chris Harwell
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20180531/0d4505d0/attachment-0001.html>