[slurm-users] big increase of MaxStepCount?
remi at rackslab.io
Wed Jan 19 08:16:23 UTC 2022
‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐
Le mercredi 12 janvier 2022 à 18:45, John R Anderson <jra at unr.edu> a écrit :
> hello, a user has requested that we set MaxStepCount to "unlimited" or 16million to accommodate some of their desired workflows. i searched around for details about this parameter & don't see alot, and i reviewed https://bugs.schedmd.com/show_bug.cgi?id=5722
> any thoughts on this? can this successfully be applied to a partition or individual nodes only? i wonder about log files exploding or worse...
I think one bottleneck here could be accounting and SlurmDBD, if you are using it. One step is one record in the step table of the SQL database. If you end up with hundreds of millions of records in the SQL table, you might experience weird issues with eg. archives or sreport. Mind that Slurm major version upgrades may come with database schema changes, and it could take a big amount of time (like several hours) with this order of magnitude.
Considering the total number of steps, I suspect this user may also generate big throughput of steps as well. At some point, slurmctld might need some specific tuning to handle it gracefully .
Rackslab: Open Source Solutions for HPC Operations
More information about the slurm-users