[slurm-users] missing/failed mem_req conversion when upgrading from 15.08.12 to 17.02.6
lech.nieroda at uni-koeln.de
Fri Jan 26 03:59:28 MST 2018
Dear slurm users,
we have run into a problem after upgrading from slurm 15.08.12 to 17.02.6 back in August 2017: all old jobs which had their memory requested with the ‚mem-per-cpu’ option have shown absurd values in the ‚reqmem‘ attribute when queried with sacct.
The values were somewhere in the PetaByte range, whereas they should have been in the GigaByte range.
An analysis of the issue has shown the following:
The attribute corresponding to ‚reqmem’ in the database is ‚mem_req‘ in the ‚cheops_job_table‘ table. It stores both ‚mem‘ and ‚mem-per-cpu’ values - the ‚mem‘ value is stored directly and the ‚mem-per-cpu’ is stored with a certain flag (bit) set.
In slurm 15.08.12 the ‚mem_req‘ attribute is a simple int (32bit) and the flag is the 32nd bit.
In slurm 17.02.6 the ‚mem_req‘ attribute is a bigint (64bit) and the flag is the 64th bit.
Thus the 'mem-per-cpu‘ values with ‚2^31‘ „added" to them appeared as PetaBytes.
The uint32_t -> uint64_t change took place with the commit at 2016-06-27 with the adnotation that it requires further "table conversion logic to MySQL, as mem_req column needs to change type to 'bigint unsigned' from 'int unsigned‘.“.
I don’t know if this work has been done but when we’ve upgraded slurm and the database was converted automatically, the values were not corrected and there was no error concerning this issue.
In case you have run into something similar, the fix is simple - we’ve converted the values ‚manually‘, i.e. made a query that selected all entries with 2^31 <= mem_req < 2^63, made a backup, cleared the 2^31 bit, set the 2^63 bit, stored and checked the values.
Dipl.-Wirt.-Inf. Lech Nieroda
Regionales Rechenzentrum der Universität zu Köln (RRZK)
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the slurm-users