[slurm-users] missing/failed mem_req conversion when upgrading from 15.08.12 to 17.02.6
Jacek Budzowski
j.budzowski at cyfronet.pl
Fri Jan 26 05:08:26 MST 2018
Hello,
Thank you for this explanation.
We also had experienced this problem (when migrating from 16.05.4 to
17.02.1) and fixed it finally by executing database query:
UPDATE slurm_acct_db.prometheus_job_table SET
mem_req=IF(mem_req&0x80000000,(mem_req&0x7fffffff)|0x8000000000000000,mem_req);
where "prometheus" is our cluster name.
This query does modifications you described.
Regards,
Jacek
W dniu 26.01.2018 o 11:59, Lech Nieroda pisze:
> Dear slurm users,
>
> we have run into a problem after upgrading from slurm 15.08.12 to
> 17.02.6 back in August 2017: all old jobs which had their memory
> requested with the ‚mem-per-cpu’ option have shown absurd values in
> the ‚reqmem‘ attribute when queried with sacct.
> The values were somewhere in the PetaByte range, whereas they should
> have been in the GigaByte range.
>
> An analysis of the issue has shown the following:
> The attribute corresponding to ‚reqmem’ in the database is ‚mem_req‘
> in the ‚cheops_job_table‘ table. It stores both ‚mem‘ and
> ‚mem-per-cpu’ values - the ‚mem‘ value is stored directly and the
> ‚mem-per-cpu’ is stored with a certain flag (bit) set.
> In slurm 15.08.12 the ‚mem_req‘ attribute is a simple int (32bit) and
> the flag is the 32nd bit.
> In slurm 17.02.6 the ‚mem_req‘ attribute is a bigint (64bit) and the
> flag is the 64th bit.
> Thus the 'mem-per-cpu‘ values with ‚2^31‘ „added" to them appeared as
> PetaBytes.
>
> The uint32_t -> uint64_t change took place with the commit at
> 2016-06-27 with the adnotation that it requires further "table
> conversion logic to MySQL, as mem_req column needs to change type to
> 'bigint unsigned' from 'int unsigned‘.“.
> I don’t know if this work has been done but when we’ve upgraded slurm
> and the database was converted automatically, the values were not
> corrected and there was no error concerning this issue.
>
> In case you have run into something similar, the fix is simple - we’ve
> converted the values ‚manually‘, i.e. made a query that selected all
> entries with 2^31 <= mem_req < 2^63, made a backup, cleared the 2^31
> bit, set the 2^63 bit, stored and checked the values.
>
>
> Regards,
> Lech
>
> --
> Dipl.-Wirt.-Inf. Lech Nieroda
> Regionales Rechenzentrum der Universität zu Köln (RRZK)
>
>
>
>
>
>
--
Jacek Budzowski
System administrator
ACC Cyfronet AGH
More information about the slurm-users
mailing list