[slurm-users] missing/failed mem_req conversion when upgrading from 15.08.12 to 17.02.6

Jacek Budzowski j.budzowski at cyfronet.pl
Fri Jan 26 05:08:26 MST 2018


Hello,

Thank you for this explanation.

We also had experienced this problem (when migrating from 16.05.4 to 
17.02.1) and fixed it finally by executing database query:

UPDATE slurm_acct_db.prometheus_job_table SET 
mem_req=IF(mem_req&0x80000000,(mem_req&0x7fffffff)|0x8000000000000000,mem_req);

where "prometheus" is our cluster name.

This query does modifications you described.

Regards,
Jacek

W dniu 26.01.2018 o 11:59, Lech Nieroda pisze:
> Dear slurm users,
>
> we have run into a problem after upgrading from slurm 15.08.12 to 
> 17.02.6 back in August 2017: all old jobs which had their memory 
> requested with the ‚mem-per-cpu’ option have shown absurd values in 
> the ‚reqmem‘ attribute when queried with sacct.
> The values were somewhere in the PetaByte range, whereas they should 
> have been in the GigaByte range.
>
> An analysis of the issue has shown the following:
> The attribute corresponding to ‚reqmem’ in the database is ‚mem_req‘ 
> in the ‚cheops_job_table‘ table. It stores both ‚mem‘ and 
> ‚mem-per-cpu’ values - the ‚mem‘ value is stored directly and the 
> ‚mem-per-cpu’ is stored with a certain flag (bit) set.
> In slurm 15.08.12 the ‚mem_req‘ attribute is a simple int (32bit) and 
> the flag is the 32nd bit.
> In slurm 17.02.6 the ‚mem_req‘ attribute is a bigint (64bit) and the 
> flag is the 64th bit.
> Thus the 'mem-per-cpu‘ values with ‚2^31‘ „added" to them appeared as 
> PetaBytes.
>
> The uint32_t -> uint64_t change took place with the commit at 
> 2016-06-27 with the adnotation that it requires further "table 
> conversion logic to MySQL, as mem_req column needs to change type to 
> 'bigint unsigned' from 'int unsigned‘.“.
> I don’t know if this work has been done but when we’ve upgraded slurm 
> and the database was converted automatically, the values were not 
> corrected and there was no error concerning this issue.
>
> In case you have run into something similar, the fix is simple - we’ve 
> converted the values ‚manually‘, i.e. made a query that selected all 
> entries with 2^31 <= mem_req < 2^63, made a backup, cleared the 2^31 
> bit, set the 2^63 bit, stored and checked the values.
>
>
> Regards,
> Lech
>
> --
> Dipl.-Wirt.-Inf. Lech Nieroda
> Regionales Rechenzentrum der Universität zu Köln (RRZK)
>
>
>
>
>
>

-- 
Jacek Budzowski
System administrator
ACC Cyfronet AGH




More information about the slurm-users mailing list