[slurm-users] Upgrade woes

Thu May 31 01:00:00 MDT 2018

Hi Lachlan,

Slurm upgrades on CentOS 7.5 should run without problems.  It seems to 
me that your problems are unrelated to the Slurm RPMs.  FWIW, I 
documented the Munge and Slurm installation as well as upgrade process 
in my Wiki page https://wiki.fysik.dtu.dk/niflheim/Slurm_installation

Hope this helps.
/Ole

On 05/31/2018 07:39 AM, Lachlan Musicman wrote:
> After last night's announcement, I decided to start the upgrade process.
> 
> Build went fine - once I worked out where munge went - and installation 
> also seemed fine.
> 
> slurmctld won't restart though.
> 
> In the logs I'm seeing:
> 
> [2018-05-31T15:20:50.810] debug:  Munge encode failed: Failed to access 
> "xxxxxxxx": No such file or directory (retrying ...)
> [2018-05-31T15:20:50.824] debug:  Recovered 4 tres
> [2018-05-31T15:20:50.825] debug:  Recovered 3 users
> [2018-05-31T15:20:50.825] debug:  Recovered 0 resources
> [2018-05-31T15:20:50.825] debug:  Recovered 1 qos
> [2018-05-31T15:20:50.825] debug:  Recovered 8 associations
> [2018-05-31T15:20:50.872] fatal: You are running with a database but for 
> some reason we have less TRES than should be here (4 < 5) and/or the 
> "billing" TRES is missing. This should only happen if the database is 
> down after an upgrade.
> 
> The first issue is that
> 
> debug:  Munge encode failed: Failed to access "xxxxxx": No such file or 
> directory (retrying ...)
> 
> contains the password in clear text ("xxxxx"). This is doubly confusing 
> - "failed to access" would indicate it meant to have the database name 
> (StorageLoc) rather than the database password (StoragePass). If it is 
> meant to be using the password, I don't think it should be clear text 
> and (in my mind) the language should be clearer.
> 
> The second issue is that slurmctld.service wont start. The last error 
> shown above
> 
> fatal: You are running with a database but for some reason we have less 
> TRES than should be here (4 < 5) and/or the "billing" TRES is missing. 
> This should only happen if the database is down after an upgrade.
> 
> Has a couple of hits in Google - an unanswered email from January
> https://groups.google.com/d/msg/slurm-users/iZsSVlqQAyE/rKiSWihyEQAJ
> 
> and a bug report
> https://bugs.schedmd.com/show_bug.cgi?id=4579
> 
> which seems to have solved a slightly different but similar problem. The 
> fix suggested in that bug report doesn't work: using MariaDB_server 
> 5.2.x my tres_table didn't have gres in it anyway.
> 
> +---------------+---------+------+----------------+------+
> | creation_time | deleted | id   | type           | name |
> +---------------+---------+------+----------------+------+
> |    1527744028 |       0 |    1 | cpu            |      |
> |    1527744028 |       0 |    2 | mem            |      |
> |    1527744028 |       0 |    3 | energy         |      |
> |    1527744028 |       0 |    4 | node           |      |
> |    1527744028 |       0 |    5 | billing        |      |
> |    1527744028 |       1 | 1000 | dynamic_offset |      |
> +---------------+---------+------+----------------+------+
> 
> 
> No idea what to try next. Any hints would be appreciated.
> 
> Running on CentOS 7.5, upgrading from 17.02.8 (and I dropped the 
> slurmdbd db and restarted it from empty when the bug report didn't work)