[slurm-users] Upgrade woes
Ole Holm Nielsen
Ole.H.Nielsen at fysik.dtu.dk
Thu May 31 01:00:00 MDT 2018
Hi Lachlan,
Slurm upgrades on CentOS 7.5 should run without problems. It seems to
me that your problems are unrelated to the Slurm RPMs. FWIW, I
documented the Munge and Slurm installation as well as upgrade process
in my Wiki page https://wiki.fysik.dtu.dk/niflheim/Slurm_installation
Hope this helps.
/Ole
On 05/31/2018 07:39 AM, Lachlan Musicman wrote:
> After last night's announcement, I decided to start the upgrade process.
>
> Build went fine - once I worked out where munge went - and installation
> also seemed fine.
>
> slurmctld won't restart though.
>
> In the logs I'm seeing:
>
> [2018-05-31T15:20:50.810] debug: Munge encode failed: Failed to access
> "xxxxxxxx": No such file or directory (retrying ...)
> [2018-05-31T15:20:50.824] debug: Recovered 4 tres
> [2018-05-31T15:20:50.825] debug: Recovered 3 users
> [2018-05-31T15:20:50.825] debug: Recovered 0 resources
> [2018-05-31T15:20:50.825] debug: Recovered 1 qos
> [2018-05-31T15:20:50.825] debug: Recovered 8 associations
> [2018-05-31T15:20:50.872] fatal: You are running with a database but for
> some reason we have less TRES than should be here (4 < 5) and/or the
> "billing" TRES is missing. This should only happen if the database is
> down after an upgrade.
>
> The first issue is that
>
> debug: Munge encode failed: Failed to access "xxxxxx": No such file or
> directory (retrying ...)
>
> contains the password in clear text ("xxxxx"). This is doubly confusing
> - "failed to access" would indicate it meant to have the database name
> (StorageLoc) rather than the database password (StoragePass). If it is
> meant to be using the password, I don't think it should be clear text
> and (in my mind) the language should be clearer.
>
> The second issue is that slurmctld.service wont start. The last error
> shown above
>
> fatal: You are running with a database but for some reason we have less
> TRES than should be here (4 < 5) and/or the "billing" TRES is missing.
> This should only happen if the database is down after an upgrade.
>
> Has a couple of hits in Google - an unanswered email from January
> https://groups.google.com/d/msg/slurm-users/iZsSVlqQAyE/rKiSWihyEQAJ
>
> and a bug report
> https://bugs.schedmd.com/show_bug.cgi?id=4579
>
> which seems to have solved a slightly different but similar problem. The
> fix suggested in that bug report doesn't work: using MariaDB_server
> 5.2.x my tres_table didn't have gres in it anyway.
>
> +---------------+---------+------+----------------+------+
> | creation_time | deleted | id | type | name |
> +---------------+---------+------+----------------+------+
> | 1527744028 | 0 | 1 | cpu | |
> | 1527744028 | 0 | 2 | mem | |
> | 1527744028 | 0 | 3 | energy | |
> | 1527744028 | 0 | 4 | node | |
> | 1527744028 | 0 | 5 | billing | |
> | 1527744028 | 1 | 1000 | dynamic_offset | |
> +---------------+---------+------+----------------+------+
>
>
> No idea what to try next. Any hints would be appreciated.
>
> Running on CentOS 7.5, upgrading from 17.02.8 (and I dropped the
> slurmdbd db and restarted it from empty when the bug report didn't work)
More information about the slurm-users
mailing list