[slurm-users] Providing users with info on wait time vs. run time
Hermann Schwärzler
hermann.schwaerzler at uibk.ac.at
Thu Sep 15 14:08:49 UTC 2022
Hi Loris,
we try to achieve the same (I guess) - which is nudging the users in the
direction of using scarce resources carefully - by using goslmailer
(https://github.com/CLIP-HPC/goslmailer) and a (not yet published - see
https://github.com/CLIP-HPC/goslmailer/issues/20) custom connector to
write a summary of every job to a file (next to the out-file).
goslmailer among others gives hints on how to optimize future jobs (e.g.
like this: "+ TIP: Please consider lowering the amount of requested CPU
cores in the future, your job has consumed less than half of the
requested CPU cores").
We do not yet use this in production but we will soon.
Just out of curiosity: how do you insert the output of seff into the
out-file of a job?
Regards,
Hermann
On 9/15/22 12:02 PM, Loris Bennett wrote:
> Hi,
>
> Today I spotted a job which requested an entire node, then had to wait
> four around 16 hours and finally ran, apparently successfully, for less
> than 4 minutes.
>
> As it currently seems in general fashionable for users round here to
> request the maximum number of cores available on a node without doing
> any scaling experiments or considering backfill, it seems like it would
> be a good idea to provide them with some feed back on wait/run times.
>
> One option would be to write the information into the Slurm 'out' file
> (currently we insert the output of 'seff). Another option would be to
> aggregate the times over, say, a month and provide a the absolute totals
> and maybe a run-to-wait ratio.
>
> Has anyone already done anything like this?
>
> Cheers,
>
> Loris
>
More information about the slurm-users
mailing list